Reading through the recently released RDNA 2 Instruction Set Architecture Reference Guide I came across some interesting information about raytracing support for the new GPU architecture. Disclaimer, the document is a little light on specifics so some of the following are extrapolations and may not be accurate.

According to the diagram released of the new RDNA 2 Workgroup Processor (WGP), a new hardware unit, the Ray Accelerator, has been added to implement ray/box and ray/triangle intersection in hardware.

That unit can calculate 4 ray/box or 1 ray/triangle intersections per clock cycle. It seems that each WGP has 2 of those units. The Ray Accelerator unit, as part of the ray/triangle intersection test I imagine, can also return the barycentric interpolation coordinates to be used to retrieve (interpolate) data on triangle surface.

A new instruction, **image_bvh_intersect_ray**, has been added to the ISA to expose the Ray Accelerator functionality. According to the ISA documentation this instruction works on both bounding boxes and triangles, depending on the input node type. The instruction is defined as follows:

image_bvh_intersect_ray vgpr_d[4], vgpr_a[11], sgpr_r[4] A16=1

The instruction inputs go through a set of vector and scalar registers: **vgpr_a**[] is the set of inputs, **vgpr_d**[] is the set of outputs and **sgpr_r**[] is the resource descriptor for the BVH tree stored in memory. The input data for the instruction are typical for a BVH traversal algorithm, namely an offset to the BVH node that stores either one or more bounding boxes or the triangle, the maximum ray length (ray_extent, useful to restrict the ray length for short range AO or point light shadows for example), the ray origin and ray direction (and inverse direction which I assume is used by the ray/box intersection test):

vgpr_a[0] = BVH node_pointer (uint32) vgpr_a[1] = ray_extent (float32) vgpr_a[2-4] = ray_origin.xyz (float32) vgpr_a[5-7] = ray_dir.xyz (float32) vgpr_a[8-10] = ray_inv_dir.xyz (float32)

The data returned through the 4 VGPRs used for the outputs (vgpr_d) depend on the type of intersection test. If it is a ray/box intersection the 4 registers hold the pointers (offsets) to the 4 BVH child nodes, sorted by distance (referred to as intersection time in the document) and the hit status. Using those node offsets, and the hit status, the shader can then decide the offset to the node for the next intersection instruction. If it is ray/triangle test, the output registers hold the distance to the triangle and the triangle ID. This could potentially be used, for example, to index into a buffer with triangle data to retrieve positions, normals, uv coordinates etc.

The resource descriptor stored in the sgpr[] registers points to the BVH tree stored in memory. Interesting to note that the descriptor reserves 42 bits to store the number of nodes of the BVH tree which allows for some pretty large BVH trees. I am not sure how large the actual BVH tree can really be though.

Finally, there is also a mode (A16=1) which allows compressing the ray direction/inverse direction into float16s instead of float32s to “shorten” the instruction and improve instruction cache utilisation and also a variation of the instruction called **image_bvh64_intersect_ray**, which allows for 64 bit pointers to BVH nodes for those extra large BVH trees.

It is worth noting that the instruction works with the base pointer of the BVH tree, stored in memory, and offsets to the BVH nodes, which implies that the shader does not probably have additional options to store, for example, the node data in Local Data Store (LDS) or registers before kicking off the intersection test.

Understandably, the ISA reference document does not say much about the format of the BVH tree, or its nodes, or how it is created. It can be inferred from the return data of the intersection instruction that it is a BVH4 though, i.e. a BVH tree with 4 child nodes per node and probably one triangle in the leaf node.

BVH4 trees for raytracing are used frequently, Intel’s Embree supports that format for CPU raytracing to improve SIMD utilisation. In my toy engine I am using Embree to build BVH2 trees, as they seem better suited to compute shader based BVH traversal. Having support for 4 ray/box intersections per clock in hardware though makes BVH4 appealing for GPU raytracing as well.

The basis of all raytracing algorithms is the ray/axis aligned bounding box and ray/triangle intersection tests and this is where most of the tree traversal time is spent on. Accelerating this part of the raytracing algorithm, supporting 4 ray/box intersections or 1 ray/tri intersections per clock, is expected to give large performance gains. As an example, the ray/box intersection I am using in my compute shader based raytracing demos is around 22 ALU instructions, it will be great to get the equivalent of four times the amount of instructions in a single clock. It would also be great if this instruction was exposed in HLSL someday to allow the option for custom acceleration structures and traversal schemes.