RDNA 2 hardware raytracing

Reading through the recently released RDNA 2 Instruction Set Architecture Reference Guide I came across some interesting information about raytracing support for the new GPU architecture. Disclaimer, the document is a little light on specifics so some of the following are extrapolations and may not be accurate.

According to the diagram released of the new RDNA 2 Workgroup Processor (WGP), a new hardware unit, the Ray Accelerator, has been added to implement ray/box and ray/triangle intersection in hardware.

That unit can calculate 4 ray/box or 1 ray/triangle intersections per clock cycle. It seems that each WGP has 2 of those units. The Ray Accelerator unit, as part of the ray/triangle intersection test I imagine, can also return the barycentric interpolation coordinates to be used to retrieve (interpolate) data on triangle surface.

A new instruction, image_bvh_intersect_ray, has been added to the ISA to expose the Ray Accelerator functionality. According to the ISA documentation this instruction works on both bounding boxes and triangles, depending on the input node type. The instruction is defined as follows:

image_bvh_intersect_ray vgpr_d[4], vgpr_a[11], sgpr_r[4] A16=1

The instruction inputs go through a set of vector and scalar registers: vgpr_a[] is the set of inputs, vgpr_d[] is the set of outputs and sgpr_r[] is the resource descriptor for the BVH tree stored in memory. The input data for the instruction are typical for a BVH traversal algorithm, namely an offset to the BVH node that stores either one or more bounding boxes or the triangle, the maximum ray length (ray_extent, useful to restrict the ray length for short range AO or point light shadows for example), the ray origin and ray direction (and inverse direction which I assume is used by the ray/box intersection test):

vgpr_a[0] = BVH node_pointer (uint32)
vgpr_a[1] = ray_extent (float32)
vgpr_a[2-4] = ray_origin.xyz (float32)
vgpr_a[5-7] = ray_dir.xyz (float32)
vgpr_a[8-10] = ray_inv_dir.xyz (float32)

The data returned through the 4 VGPRs used for the outputs (vgpr_d) depend on the type of intersection test. If it is a ray/box intersection the 4 registers hold the pointers (offsets) to the 4 BVH child nodes, sorted by distance (referred to as intersection time in the document) and the hit status. Using those node offsets, and the hit status, the shader can then decide the offset to the node for the next intersection instruction. If it is ray/triangle test, the output registers hold the distance to the triangle and the triangle ID. This could potentially be used, for example, to index into a buffer with triangle data to retrieve positions, normals, uv coordinates etc.

The resource descriptor stored in the sgpr[] registers points to the BVH tree stored in memory. Interesting to note that the descriptor reserves 42 bits to store the number of nodes of the BVH tree which allows for some pretty large BVH trees. I am not sure how large the actual BVH tree can really be though.

Finally, there is also a mode (A16=1) which allows compressing the ray direction/inverse direction into float16s instead of float32s to “shorten” the instruction and improve instruction cache utilisation and also a variation of the instruction called image_bvh64_intersect_ray, which allows for 64 bit pointers to BVH nodes for those extra large BVH trees.

It is worth noting that the instruction works with the base pointer of the BVH tree, stored in memory, and offsets to the BVH nodes, which implies that the shader does not probably have additional options to store, for example, the node data in Local Data Store (LDS) or registers before kicking off the intersection test.

Understandably, the ISA reference document does not say much about the format of the BVH tree, or its nodes, or how it is created. It can be inferred from the return data of the intersection instruction that it is a BVH4 though, i.e. a BVH tree with 4 child nodes per node and probably one triangle in the leaf node.

BVH4 trees for raytracing are used frequently, Intel’s Embree supports that format for CPU raytracing to improve SIMD utilisation. In my toy engine I am using Embree to build BVH2 trees, as they seem better suited to compute shader based BVH traversal. Having support for 4 ray/box intersections per clock in hardware though makes BVH4 appealing for GPU raytracing as well.

The basis of all raytracing algorithms is the ray/axis aligned bounding box and ray/triangle intersection tests and this is where most of the tree traversal time is spent on. Accelerating this part of the raytracing algorithm, supporting 4 ray/box intersections or 1 ray/tri intersections per clock, is expected to give large performance gains. As an example, the ray/box intersection I am using in my compute shader based raytracing demos is around 22 ALU instructions, it will be great to get the equivalent of four times the amount of instructions in a single clock. It would also be great if this instruction was exposed in HLSL someday to allow the option for custom acceleration structures and traversal schemes.

RDNA 2 hardware raytracing

To z-prepass or not to z-prepass

Inspired by an interesting discussion on Twitter about its use in games, I put together some thoughts on the z-prepass and its use in the rendering pipeline.

To begin with, what is a z-prepass (zed-prepass, as we call it in the UK): in its most basic form it is a rendering pass in which we render large, opaque meshes (a partial z-prepass) or all the opaque meshes (a full z-prepass) in the scene using a vertex shader only, with no pixel shaders or rendertargets bound, to populate the depth buffer (aka z-buffer).

Continue reading “To z-prepass or not to z-prepass”
To z-prepass or not to z-prepass

What is shader occupancy and why do we care about it?

I had a good question through Twitter DMs about what occupancy is and why is it important for shader performance, I am expanding my answer into a quick blog post.

First some context, GPUs, while running a shader program, batch together 64 or 32 pixels or vertices (called wavefronts on AMD or warps on NVidia) and execute a single instruction on all of them in one go. Typically, instructions that fetch data from memory have a lot of latency (i.e. the time between issuing the instruction and getting the result back is long), due to having to reach out to caches and maybe RAM to fetch data. This latency has the potential to stall the GPU while waiting for the data.

Continue reading “What is shader occupancy and why do we care about it?”
What is shader occupancy and why do we care about it?

Adding support for two-level acceleration for raytracing

In my (compute shader) raytracing experiments so far I’ve been using a bounding volume hierarchy (BVH) of the whole scene to accelerate ray/box and ray/tri intersections. This is straightforward and easy to use and also allows for pre-baking of the scene BVH to avoid calculating it on load time.

This approach has at least 3 shortcomings though: first, as the (monolithic) BVH requires knowledge of the whole scene on bake, it makes it hard to update the scene while the camera moves around or to add/remove models to the scene due to gameplay reasons. Second, since the BVH stores bounding boxes/tris in world space, it makes it hard to raytrace animating models (without rebaking the BVH every frame, something very expensive). Last, the monolithic BVH stores every instance of the same model/mesh repeatedly, without being able to reuse it, potentially wasting large amounts of memory.

Continue reading “Adding support for two-level acceleration for raytracing”
Adding support for two-level acceleration for raytracing

Using Embree generated BVH trees for GPU raytracing

Intel released it’s Embree collection of raytracing kernels, with source, sometime ago and I recently had the opportunity to try and compare the included BVH generation library against my own implementation in terms of BVH tree quality. The quality of a scene’s BVH is critical for quick traversal during raytracing and typically a number of techniques, such as the Surface Area Heuristic one I am currently using, is applied during the tree generation to improve it.

Continue reading “Using Embree generated BVH trees for GPU raytracing”
Using Embree generated BVH trees for GPU raytracing

Open Twitter DMs, a 2 year retrospective

It’s been two years since I’ve opened my Twitter DMs and invited people to ask graphics related questions and seek advice about how to get into the games industry. I think it’s time for a quick retrospective.

The majority of the questions revolve around how to start learning graphics programming. Nowadays there is a large choice of graphics APIs, graphics frameworks, high quality engines freely available, advanced graphics techniques and the visual bar in modern games is very high. It is understandable that someone trying to learn graphics programming may feel overwhelmed. The many options one has nowadays can also work to their advantage though, I have written some advice on how one can approach learning graphics programming in an older post.

Continue reading “Open Twitter DMs, a 2 year retrospective”
Open Twitter DMs, a 2 year retrospective

A Survey of Temporal Antialiasing Techniques: presentation notes

At Eurographics 2020 virtual conference, Lei Yang did a presentation of the Survey of Temporal Antialiasing Techniques report which included a good overview of TAA and temporal upsampling, its issues and future research.

I have taken some notes while watching it and I am sharing them here in case anyone finds them useful.

Continue reading “A Survey of Temporal Antialiasing Techniques: presentation notes”
A Survey of Temporal Antialiasing Techniques: presentation notes

Optimizing for the RDNA Architecture: presentation notes

AMD recently released a great presentation on RDNA, with a lot of details on the new GPU architecture and optimisation advice.

While watching it I took some notes (like you do in real conferences) and I am sharing them here in case anyone finds them useful. They can be used as a TLDR but I actively encourage you to watch the presentation as well, some parts won’t make much sense without it. I have added some extra notes of my own in brackets [] as well.

Continue reading “Optimizing for the RDNA Architecture: presentation notes”
Optimizing for the RDNA Architecture: presentation notes

GPU architecture resources

I am often get asked in DMs about how GPUs work. There is a lot of information on GPU architectures online, one can start with these:

And then can refer to these for a more in-depth study:

Continue reading “GPU architecture resources”
GPU architecture resources

Validating physical light units

Recently I added support for physical light units to my toy engine, based on Frostbite’s and Filament’s great guides. Switching to physical lights units allows one to use “real-world” light intensities (for example in lux and lumens), camera settings (eg aperture, shutter speed and ISO) as well as mix analytical and captured light sources (HDR environment maps) correctly.

Continue reading “Validating physical light units”
Validating physical light units