An introduction to workgraphs part 2: Performance

In the previous blog post I described a simple workgraph implementation of a hybrid shadowing system. It was based on a tile classification system with 3 levels (or nodes in workgraph parlance), one to decide which tiles are facing away from the Sun, and as such need no shadows, one to raymarch the surviving tiles’ pixels towards the Sun and look for collisions in the depth buffer and a final one to raytrace the remaining pixels to find collisions in the acceleration structure. In this blog post I explore workgraphs performance a bit and share some observations.

There aren’t many details yet how workgraphs work under the hood. At this year’s HPG conference AMD made a presentation which briefly discussed how different ring buffers are allocated for each workgraph node, in VRAM, to store commands triggering its execution, written to by other nodes.

This is a presentation worth watching, in summary this ring buffer scheme is similar to the mechanism the CPU uses to pass commands down to the GPU, a main difference in this case is that the GPU itself (SIMD units) can write to those ring buffers. The SIMD units will output records to those ring buffers, corresponding to the nodes they are targeting. The Compute unit of the Micro Engine Scheduler will look into those ring buffers and will use any records there to spawn warps for the various SIMD units in the GPU. There is some additional logic to avoid deadlocks by using the declared number of records each node is expected to output, to ensure that the ring buffer for a downstream node can accommodate the submitted work. There is no information yet on how this is implemented on NVidia GPUs.

Going back workgraph performance, the hybrid shadowcasting system was good as a workgraph learning exercise but I didn’t have a reference implementation using a more traditional compute shader based path to compare with, so I decided to convert parts of FidelityFX’s SSSR technique, I already have integrated to the toy engine, to a workgraph implementation to compare performance. SSSR also implements a classification pass: a compute shader processes the gbuffer deciding which pixels need raymarching based on the material roughness, outputting their coordinates to a large buffer. Not all threads/pixels in a warp will need raymarching, so the classification shader performs an on the fly stream compaction to avoid writing invalid threads/pixels to the buffer. Then a second compute shader takes over to do the raymarching and calculate the screen space reflections.

The implementation of those 2 passes as a workgraph was straightforward reusing the shader code from the original implementation with some workgraph specific syntax. All we need is 2 nodes, the first one to classify the pixels:

[Shader("node")]
[NodeLaunch("broadcasting")]
[NodeIsProgramEntry]
[NodeDispatchGrid(1, 1, 1)] // This will be overriden during pipeline creation
[numthreads(8, 8, 1)]
void ClassifyPixels_Node(
    in uint3 globalThreadID : SV_DispatchThreadID,
    in uint2 group_id : SV_GroupID, 
    in uint group_index : SV_GroupIndex,
    [MaxRecords(64)] NodeOutput<ThreadRecord> SSR_Node
)

The classification is still performed using the roughness but unlike the original classification there is no need to compact and write anything to a buffer, we just spawn a node for the thread/pixel that needs raymarching. The node input is the pixel coordinates and some info on whether we need to copy the raymarched value to the other quad pixels (and which), packed in 32 bits:

struct ThreadRecord
{
    uint screenPosX : 15;
    uint screenPosY : 14;
    uint copy_horizontal: 1;
    uint copy_vertical: 1;
    uint copy_diagonal: 1;
};

The main modification to the original code was to remove the stream compaction and output to the buffer and replace it with this node spawn logic:

ThreadNodeOutputRecords<ThreadRecord> threadRecord = SSR_Node.GetThreadNodeOutputRecords(needs_ray ? 1 : 0);

    if (needs_ray)
    {
        threadRecord.Get().screenPosX = screenPos.x;
        threadRecord.Get().screenPosY = screenPos.y;
        threadRecord.Get().copy_horizontal = copy_horizontal;
        threadRecord.Get().copy_vertical = copy_vertical;
        threadRecord.Get().copy_diagonal = copy_diagonal;
    }
    
    threadRecord.OutputComplete();

I kept the parts of the code that output which tiles need denoising for the denoising pass for reasons I will explain later.

The second node needed is one to raymarch a particular pixel:

[Shader("node")]
[NodeLaunch("thread")]
void SSR_Node(
    ThreadNodeInputRecord< ThreadRecord> inputData
)

Again, the code is reused straight from the original SSSR, nothing worth calling out here. That node writes the reflections to a rendertarget, ready to be used by the denoising passes.

I mentioned earlier that, for simplicity, I only ported the classification and raymarching passes, the reason for this is that denoising introduces a dependency between neighbouring tiles, take for example a blur filter towards the edge of a tile. This will need access to the pixels of the tile next to it but there is no guarantee that that tile’s pixels will have been processed by the time this is required. With a compute shader based approach, this would be solved by a barrier between the dispatches (some more info on node synchronisation in workgraphs here)

Quick showcase of SSSR using the workgraph for classification and raymarching, the output is identical to the original implementation:

We now have something that we can compare the workgraph implementation to. All rendering costs refer to SSSR targeting a 1080p resolution on an RTX 3080 laptop GPU. Checking the original implementation in GPU Trace, focusing on the classification and raymarching passes, they add up to a total of 0.65ms:

The bottom graph, in orange, is the shader occupancy. We can also see the drain that is needed due to the barriers (blue lines) between those 2 passes to ensure that the UAV buffer has been fully written to before raymarching can begin. In this case there is a second barrier because there is another quick pass between classification and raymarching to prepare the indirect dispatch arguments.

Next is the equivalent functionality implemented as a workgraph:

The execution cost is now 2.18ms.

It looks like there is some structure in the occupancy graph starting with a big block followed by a number of similarly shaped smaller blocks. Nsight Graphics’ GPU Trace (using version 2024.2) recognises workgraphs, and can show the utilisation of the various units, similarly to a compute shader, but it doesn’t seem to provide a deeper analysis of the reasons why warp launch stalls. It appears that the workgraph version has lower shader occupancy in overall.

Quick comparison of the combined top level bottlenecks for the compute shader version

and the workgraph based version

shows much lower SM utilisation and also warp occupancy in the latter.

Performing the capture with Real-Time Shader Profiler on:

will provide some more information on the workgraph:

Here we can see that the Classification Node has a theoretical occupancy of 32 and the SSSR (Raymarching) node an occupancy of 16. In the compute shader version, Classification and Raymarching shaders both have a theoretical occupancy of 32.

Latest NSight Graphics introduces a nice new feature, by right clicking on a shader in the Shader Pipelines tab above we can see it where in the timeline it is executing. Workgraphs are supported as well, for example this is the execution for the classification node, zooming in a bit to see more detail:

and this is the execution for the raymarch node

We notice some overlap in the execution, it appears as if in the repeated pattern in the shader occupancy graph, the first peak belongs to the classification and the large drain belongs to the raymarching node execution.

To understand the performance profile let’s take a step back and simplify the problem a bit focusing on the classification node alone and modifying it not to spawn any Raymarching nodes.

Although not totally equivalent in the work they do, a workgraph with only classification nodes spawned costs 0.49ms, compared to the compute shader based classification pass cost of 0.15ms.

According to GPU trace, the theoretical max occupancy of both is the same, at 32 warps per SIMD, but the actual average occupancy is 25 (52%) for the compute shader version and only 13 (27%) for the workgraph version. One interesting thing I noticed comparing the GPU Traces for both passes, is that while for the compute shader based ClassifyTiles the number of threadgroups launched (32,400), the number of warps launched (64,800) and threads launched (2,073,600) are as expected for a 8×8 threadgroup and a 1920×1080 image, the corresponding numbers for the workgraph based one are quite different: thread groups launched is 40,516, number of warps is 72,937 and the number of threads is 2,082,822. These numbers make little sense, for example, even if the number of thread groups was correct you’d expect 40,516 threadgroups (8×8) to launch 81,032 warps. I did a quick experiment spawning nodes with a 8×4 threadgroup size (exactly a warp) and although the number of threads launched was the same as in the 8×8 case, the number of threadgroups launched was 72,916 and the number of warps launched was 72,937. These numbers all are far from the expected but this time the number of (warp-sized) threadgroups and the number of warps are much closer. I am not sure if this a bug in NSight or in the driver or there is something in the way workgraphs launch work.

Another interesting thing I noticed was that spawning the ClassifyTiles node without it doing any work (shader has an empty body) does not really reduce its execution time. The compute shader version of the ClassifyTiles pass with an empty shader costs ~0.03ms. The workgraph based version costs 0.38ms only around 1ms less than the version that does actually do work (but not spawn any nodes). Comparing the instructions executed by the non-empty ClassifyTiles workgraph node:

with that executed by the empty node

the empty node seems to spend most of its time in shared memory instructions, synchronisation instructions and integer maths, which may have to do with indexing. This suggests that there is likely quite a bit of overhead in the spawning of the workgraph nodes and most of the original ClassifyTiles node cost is due to this.

Going back to the full workgraph SSSR implementation, one thing that stands out is the Sync Q Waiting metric in GPU Trace, something that doesn’t appear for the compute shader version:

This signifies that the Front End, which receives instructions and dispatches them to either the graphics or compute pipes, is stalling. There isn’t enough info to determine if this is related to the way nodes are launched and executed but there might be a correlation.

So far I launched nodes to perform the raymarching using the “thread” launch mode, which sounds like a natural fit to the way the Classification node determines per pixel whether it needs raymarching or not. With this mode, the GPU will try to batch threads into warps but there is no guarantee where (which warp) they will end up in. An alternative to this launch mode is “coalescing”, in which we can declare a thread group size which the GPU will attempt to (but there is no guarantee that it will manage to) fill with the maximum number of records specified:

[Shader("node")]
[NodeLaunch("coalescing")]
[NumThreads(64, 1, 1)]
void SSR_Node(
    [MaxRecords(64)] GroupNodeInputRecords<ThreadRecord> inputData,
    uint threadIndex : SV_GroupIndex
)
{
}

If we compare to the top bottlenecks of the “thread” launch SSSR node above, the SM throughput is a few percent higher in this case and also the occupancy has increased to 23.3% up from 18.4%,

Also, the overall workgraph dispatch cost when down from 2.18ms to 1.81ms a noticeable improvement. Although the advice is to use coalescing only if you need to use thread group shared memory, in this case it appears that this helps a bit with cache efficiency (L1TEX throughput 15.6% compared to 13% with thread launch) and this contributes to the cost reduction.

I ended this investigation with more questions about workgraphs’ performance than I answered. For the usecase I profiled, possibly for typical classification based techniques, workgraphs appear to be much slower at the moment. It is early days though and I am sure and performance will improve, this is a very exciting technology and I am looking forward to seeing how it will evolve.

Some knowledge of how workgraphs are implemented internally by each IHV would be helpful, as well as deeper profiling and debug info will be necessary for efficient workgraph programming. To add to the wish list, it would be great if NSight Graphics also showed the produced SASS for the shaders to get a better idea what happens under the hood.

An introduction to workgraphs part 2: Performance

A quick introduction to workgraphs

Workgraphs is a new feature added recently to DirectX12 with hardware support from NVidia and AMD. It aims to enable a GPU to produce and consume work without involving the CPU in dispatching that work. I spent some time the past couple of weeks experimenting with workgraphs and I’ve put together this high level tutorial on how one can go about using them.

I cobbled together parts I already had in the toy engine to implement a shadow raytracer, comprised of 3 steps: first isolate and filter out pixels that are backfacing to the light (and as such are always is shadow), raymarch the surviving pixels towards the light looking for hits in the depth buffer and then, for pixels that failed to find a hit, raytrace using the acceleration structure. The technique, even if a bit contrived and maybe not too practical, it provides us with many opportunities to produce and consume work on the GPU.

Continue reading “A quick introduction to workgraphs”
A quick introduction to workgraphs

Low-level thinking in high-level shading languages 2023

Low-level thinking in high-level shading languages” (Emil Persson, 2013), along with its followup “Low-level Shader Optimization for Next-Gen and DX11“, is in my top 3 most influential presentations, one that changed the way I think about shader programming in general (since I know you are wondering the other 2 are Natty Hoffman’s Physically Based Shading and John Hable’s Uncharted 2 HDR Lighting). When I started graphics programming shaders were handcrafted in assembly, the HLSL compiler being in its infancy. It used to be the case that you could beat the compiler and manually produce superior shader assembly. This changed over the years, the compiler improved immensely and I learned to rely more on it and not pay much attention to, or think about the produced assembly code.

Continue reading “Low-level thinking in high-level shading languages 2023”
Low-level thinking in high-level shading languages 2023

A gentler introduction to ReSTIR

Recently I started exploring ReSTIR, using mainly the Gentle Introduction to ReSTIR Siggraph course and the original paper. I began with direct illumination (ReSTIR DI), to quickly set it up and get something working. ReSTIR is a very interesting technique that gives great results but there is a lot of Maths behind it that might dissuade people that want to dip their toes in it, which is a shame. Resources like the Gentle Introduction help a lot towards clarifying some of the theory behind it but it is still Maths heavy. In this post I will be attempting a more “qualitative” discussion of ReSTIR, going straight to the results, avoiding referencing the Maths behind it too much.

Continue reading “A gentler introduction to ReSTIR”
A gentler introduction to ReSTIR

Raytraced Order Independent Transparency part 2

In the previous blog post I discussed how raytracing can be used to achieve order independent transparency (OIT) for some types of transparencies and how it compares to other OIT methods like per pixel linked lists and Multi-layer Alpha blending (MLAB). The basic idea, since DXR doesn’t support distance sorted traversal of the BVH, was to use a closest hit shader to find the closest to the camera intersection and then use the position of the intersection as the origin of a new ray to trace through the BVH. That worked well in that it achieves OIT but the fact that each ray has to traverse the TLAS from the top every time we find an intersection is not ideal.

Continue reading “Raytraced Order Independent Transparency part 2”
Raytraced Order Independent Transparency part 2

Raytraced Order Independent Transparency

About a year ago I reviewed a number of Order Independent Transparency (OIT) techniques (part 1, part 2, part 3), each achieving a difference combination of performance, quality and memory requirements. None of them fully solved OIT though and I ended the series wondering what raytraced transparency would look like. Recently I added (some) DXR support to the toy engine and I was curious to see how it would work, so I did a quick implementation.

Continue reading “Raytraced Order Independent Transparency”
Raytraced Order Independent Transparency

Experimenting with fp16, part 2

In the previous blog post I discussed how enabling fp16 for a particular shader didn’t seem to make a performance difference and also forced the compiler to allocate a larger number of VGPRs compared to the fp32 version (108 vs 81), which seemed weird as one of the (expected) advantages of fp16 is reduced register allocation. So I spent some more time investigating why this is happening. The shader I am referring to is the ResolveTemporal.hlsl one from the FidelityFX SSSR sample I recently integrated to my toy renderer.

Continue reading “Experimenting with fp16, part 2”
Experimenting with fp16, part 2

Experimenting with fp16 in shaders

With recent GPUs and shader models there is good support for 16 bit floating point numbers and operations in shaders. On paper, the main advantages of the a fp16 representation are that it allows packing two 16 numbers into a single 32 bit register, reducing the register allocation for a shader/increasing occupancy, and also allows reduction of ALU instruction count by performing instructions to packed 32 bit registers directly (i.e. affecting the two packed fp16 numbers independently). I spent some time investigating what fp16 looks like at the ISA level (GCN 5) and am sharing some notes I took.

I started with a very simple compute shader implementing some fp16 maths as a test. I compiled it using the 6.2 shading model and the -enable-16bit-types DXC command line argument.

Continue reading “Experimenting with fp16 in shaders”
Experimenting with fp16 in shaders

Stream compaction using wave intrinsics

It is common knowledge that removing unnecessary work is a crucial mechanism for achieving good performance on the GPU. We routinely create lists of visible model instances of example using frustum and other means of culling to avoid rendering geometry that will not contribute to the final image. While it is easy to create such lists on the CPU, it may not be as trivial for work generated on the GPU, for example when using GPU driven culling/rendering, or deciding which pixels in the image to raytrace reflections for. Such operations typically produce lists with invalid (culled) work items, which is not a very effective way to make use of a GPU’s batch processing nature, either having to skip over shader code or introduce idle (inactive) threads in a wave.

Continue reading “Stream compaction using wave intrinsics”
Stream compaction using wave intrinsics

Notes on screenspace reflections with FidelityFX SSSR

Today I set out to replace the old SSR implementation in the toy engine with AMD’s FidelityFX’s one but in the end I got distracted and spent the day studying how it works instead. This is a modern SSR solution that implements a lot of good practices so I’ve gathered my notes in a blog post in case someone finds it of interest. This is not intended as an exhaustive description of the technique, more like a few interesting observations.

The technique takes as an input the main rendertarget, the worldspace normal buffer, a roughness buffer, a hierarchical depth buffer and an environment cubemap. The hierarchical depth buffer is a mip chain where each mip level pixel is the minimum of the previous level’s 2×2 area depths (mip 0 corresponds to the screen-sized, original depth buffer). It will used later to speed up raymarching but can also used in many other techniques, like GPU occlusion culling.

Continue reading “Notes on screenspace reflections with FidelityFX SSSR”
Notes on screenspace reflections with FidelityFX SSSR