# Notes on screenspace reflections with FidelityFX SSSR

Today I set out to replace the old SSR implementation in the toy engine with AMD’s FidelityFX’s one but in the end I got distracted and spent the day studying how it works instead. This is a modern SSR solution that implements a lot of good practices so I’ve gathered my notes in a blog post in case someone finds it of interest. This is not intended as an exhaustive description of the technique, more like a few interesting observations.

The technique takes as an input the main rendertarget, the worldspace normal buffer, a roughness buffer, a hierarchical depth buffer and an environment cubemap. The hierarchical depth buffer is a mip chain where each mip level pixel is the minimum of the previous level’s 2×2 area depths (mip 0 corresponds to the screen-sized, original depth buffer). It will used later to speed up raymarching but can also used in many other techniques, like GPU occlusion culling.

A straightforward but inefficient way to create this depth mip chain is to calculate each mip level using a compute shader dispatch, using the previous mip level as an input. This will cause the GPU to drain the pipeline between each dispatch to ensure that a mip level has been fully written to before it proceeds to the next one. I don’t have such an implementation in the toy engine to demonstrate how this looks so I will borrow the trace from AMD’s Single Pass Downsampler presentation that showcases the effect well.

In the end, the stalls between each dispatch become longer than the actual dispatch duration wasting a lot of GPU time. A more modern alternative is to calculate all mip levels in a single dispatch, using the groupshared memory to exchange data between the mip levels, removing the need for barriers and pipeline drains.

This pass costs 0.12ms on a RTX 3080 mobile at 1080p.

The actual SSR work starts with a classification pass (ClassifyTiles, 0.17ms), which has a dual purpose, first to add all pixels needing tracing rays to a global buffer, along with their screen space coordinates. Also, in case coarse tracing has been selected (less than 4 samples per pixel quad), it marks whether that pixel should be copied to its neighbours in the pixel quad or not. The decision whether a pixel needs a ray or not is made based on the roughness, very rough surfaces don’t get any, relying on the prefiltered environment map instead as an approximation. The other purpose of this pass is to mark tiles that contain “reflective” pixels (i.e. that need tracing). This is to later speed up denoising allowing us to process only tiles that actually need it.

Worth noting in this pass is that it will sample the environment map and write the radiance for pixels that don’t need rays (because of high roughness) in a separate rendertarget. The environment map is pre-blurred and the roughness is used to select the appropriate cubemap mip level. The reason this is done here is because we won’t be touching pixels that don’t require tracing any more, although if on a pinch for memory it could be done when adding the SSR result to main rendertarget later. The other thing worth noting is that it extracts and writes the roughness to a separate, single channel rendertarget to improve memory bandwidth in subsequent passes. Finally, the shaders remaps the linear group thread index to a Morton order one (FFX_DNSR_Reflections_RemapLane8x8) to improve locality of texture accesses. This can speed up memory reads by improving cache coherence.

Next step is to prepare a 128×128 texture with screen-space (animated) blue noise (PrepareBlueNoiseTexture, <0.01ms), based on the work of Eric Heitz. This will be used later to drive the stochastic sampling of the specular lobe.

Once this done we are (almost) ready to to ray march, the only problem is that we don’t know the size of the global array of pixels to trace on the CPU to launch a Dispatch. For that reason, the technique fills a buffer with indirect arguments, with data already known to the GPU and uses a ExecuteIndirect instead. The indirect arguments buffer is populated during the PrepareIndirectArgs pass, at <0.01ms. Nothing particular to mention here apart from that it adds 2 entries to the indirect buffer, one for the pixels to trace and one for the tiles to denoise later.

The real work starts in the next pass (Intersect, 0.88ms) where ray marching for the pixels that need it finally happens, using an ExecuteIndirect with the arguments computed earlier.

There is a couple of things worth discussing here. To reduce the raymarching cost, typically, a maximum of one ray per pixel is raymarched, drawn from the GGX distribution (image from)

This can lead to rays that are below the horizon (as showcased above) or masked or shadowed by the BRDF’s geometric function. To improve the quality of the raymarched rays the sample uses the distribution of visible only normals, which is effectively the potential set of unoccluded, valid rays.

Next, to accelerate raymarching, the hierarchical z-buffer produced earlier is used (images from). Raymarching starts at mip 0 (highest resolution) of the depth buffer.

If no collision is detected, we drop to a lower resolution mip and continue raymarching. Again, if no collision is detected we continue dropping to lower resolution mips until we detect one.

If we do, we climb back up to a higher resolution mip and continue from there. This allows quickly skipping large parts of empty space in the depth buffer.

The raymarcing loop in the shader is pretty straightforward, you can see the current_mip increasing or decreasing based on the collision result, but it is worth calling out a interesting trick.

In many cases raymarching is a highly divergent operation meaning that some rays will take many more steps than others to find a hit (SSAO, glossy reflections etc). This means that a single long ray can stall the whole wave from retiring, keeping a lot of threads inactive. To improve this, the sample stops raymarching (exit_due_to_low_occupancy) when the number of active threads (counting them with WaveActiveCountBits(true)) has fallen below a threshold. This will prevent a small number of long rays holding back the whole wave.

Finally, the technique “validates” the result (FFX_SSSR_ValidateHit) to make sure that it is not out of screen but also we didn’t hit a backface. Also a depth buffer “thickness” is used which is a content specific quality improvement that stops very thin geometry (like grass, hair) occluding more than they should be (eg a single blade of grass is a massive canyon in the depth buffer, stopping all rays that will go under it). I would imagine parts of the validation step (eg, the thickness test) could be integrated into the raymarching loop above to continue marching even if the ray misses a surface due to it being too thin (i.e. to allow the ray to go “beneath” a surface). The validation step produces a “confidence” value which is used to interpolate between a sample from the main rendertarget (valid screen space reflection) and the environment map fallback. As a final step the shader copies the calculated radiance to the neighbouring pixels, if rendering SSR at lower resolution.

This is the result of the intersection pass, we can notice that mirror-like reflections (low roughness) are relatively noise free unlike areas with high roughness. Areas like to top of the pawns, where no screenspace information is available to raymarch, fall back to the environment map.

Any stochastic technique will introduce noise due to low sample count, so the sample wraps up with some denoising passes, typically both spatial and temporal. The first pass, (DNSR Reproject, 0.66ms) will prepare the history buffer used in the temporal resolve later by reprojecting the SSR result from the previous frame onto the current one. This works on the list of tiles produced earlier during the ClassifyTiles pass, using an ExecuteIndirect.

Typical reprojection, as used in TAA for example, is not suitable for reflections, as the objects in the reflections move based on their own depth and not the depth of the surfaces they reflect on (captured in the depth buffer). That is, we need to determine where the reflected objects were in the previous frame. To get the reflected point uv position FFX_DNSR_Reflections_GetHitPositionReprojection() reconstructs the hit point 3d position using the surface point and surface relative hit point distance, stored during the raymarching pass earlier. The technique uses a lot of information from the previous frame (radiance, roughness, depth and normals) to determine sample similarity, supporting paths for various cases (like mirror or glossy surfaces) and determines disocclusion using both normals and depths from current and previous frames. The outcome of the reprojection pass is a “reprojected” radiance and a disocclusion factor. This factor is used to determine if the reprojected sample is good enough to keep or it belongs to a newly exposed surface that we have no information for in the previous frame (in which case 0 is written out).

The variance of the pixel reflection luminance (between new and previous frame radiance) is also stored, one thing to note is that the code uses a technique described here, counting the number of frames (num_samples) after dissoclusion of the pixel to stabilise the variance (variance_mix) after a disocclusion event (i.e. it will interpolate towards the old variance initially, ending up towards the new variance as the number of frames after the disocclusion increases). This variance will be later used to guide spatially filtering of the SSR image. Finally this pass calculates the average radiance per 8×8 tile, using groupshared memory, and stores to a rendertarget. Since all intermediate values as stored as halves, the shader does some extra work to pack the halves to uints to store to the groupshared memory, using the f32tof16() function which will return the half floating point precision number in the bottom 16 bits of a uint (the bit representation). Like mentioned, the output of this pass will act as the history buffer for the Temporal Resolve pass later.

This is the result of the reprojection pass, we can notice the tile-based processing and also that the sample does not clear the rendertarget (so you can see results from previous frame as the camera moves around) because it doesn’t need to — we know exactly which tiles contain valid data for this frame.

After reprojection, a spatial filtering pass follows (DNSR Prefilter, 0.34ms) that processes the result of the Intersect pass, to reduce noise. This works with the tiles marked as needing denoising during the ClassifyTiles as well. Filtering is performed using 15 samples (+ the central one) drawn from the Halton sequence and converted to integer offsets:

The filtering itself uses normal, depth, radiance and variance differences to adjust the weight of each sample, to avoid overblurring.

This is the result of the spatial filter pass, we notice that blurring is stronger in noisy areas, which have higher variance.

The final pass is a temporal resolve to converge to the final image (DNSR Resolve Temporal, 0.53ms). This uses the reprojected image from the Reprojection pass (as the history buffer) and the spatially filtered image from the Prefilter pass (as the current buffer). It clips the history sample using local area statistics like variance and mean. Blending current and history sample doesn’t use a constant factor as it is usual in temporal antialiasing but uses the trick mentioned already to bias the factor based on the number of frames since disocclusion:

This will result in starting with the current sample (new_signal) right after disocclusion, gradually blending towards the history (clipped_old_signal) as the number of frames since disocclusion inscreases.

This is the result of the temporal resolve pass, most of the noise has be suppressed.

A few more observations, the code uses 16bit floats whenever possible to reduce the number VGPRs required for the ALU operations. This can increase the shader occupancy, which can help hide the latency due to memory reads. The sample allocates quite a few internal rendertargets, I counted 13, to store the history of almost all provided inputs and various other data, prioritising speed over memory.

This technique achieves good SSR quality, it is educational to study with lots of good practices in place and kudos to AMD for releasing the code to the public.

Now, I should go back and actually do some work to integrate this to the toy engine!

# Order Independent Transparency: Endgame

In the past 2 posts (part 1, part 2), I discussed the complexity of correctly sorting and rendering transparent surfaces and I went through a few OIT options, including per pixel linked lists, transmission function approximations and the role rasteriser order views can play in all this. In this last post I will continue and wrap up my OIT exploration discussing a couple more transmittance function approximations that can be used to implement improved transparency rendering.

Continue reading “Order Independent Transparency: Endgame”

# Order independent transparency, part 2

In the previous blog post we discussed how to use a per-pixel linked list (PPLL) to implement order independent transparency and how the unbounded nature of overlapping transparent surfaces can be problematic in terms of memory requirements, and ultimately may lead to rendering artifacts. In this blog post we explore approximations that are bounded in terms of memory.

Also in the previous blog post we discussed the transmittance function

${T(z_i) = \prod_{k=0}^{i}{(1-a_k)}}$

and how it can be used to describe how radiance is reduced as it travels through transparent surfaces

${\sum_{i=0}^{N-1}{c_i a_i T(z_{i-1})} + T(z_{N-1}) R}$

Continue reading “Order independent transparency, part 2”

# Order independent transparency, part 1

Correctly sorting transparent meshes is one of the hard problems in realtime rendering. The typical solution to this is to sort the meshes by distance and render them back to front. This can’t address all transparency sorting artifacts, in cases when meshes intersect or self sorting artifacts in transparent meshes. Also correctly sorting particles with transparent meshes can be a challenge sometimes.

For an extreme but illustrative example, here is screenshot of Sponza rendered with transparent materials using hardware alpha blending with the over operator c1*a1 + c2*(1-a1). To mix things up I have added a particle system using additive blending (the yellow one)

Continue reading “Order independent transparency, part 1”

# Accelerating raytracing using software VRS

I discussed in the previous post how divergence in a wave can slow down the execution of a shader. This is particularly evident during raytracing global illumination (GI) as ray directions between neighbouring wave threads can differ a lot forcing different paths through the BVH tree with different number of steps. I described how ray binning can be used to improve this but it is not the only technique we can use. For this one we will use a different approach, instead of “binning” based on the similarity of input rays we will “bin” threads based on the raytraced GI’s output. This makes sense because it is usually quite uniform, with large and sudden transitions happening mainly at geometric edges.

Continue reading “Accelerating raytracing using software VRS”

# Increasing wave coherence with ray binning

Raytracing involves traversing acceleration structures (BVH), which encode a scene’s geometry, in an attempt to identify ray/triangle collisions. Depending on the rendering technique, eg raytraced shadows, AO, GI, rays can diverge a lot in direction something. This introduces additional cache and memory pressure as rays in a wave can follow very different paths in the BVH, ultimately colliding with different triangles.

Yet, ray generation is typically based on a limited set of random samples (eg a tiled blue noise texture), which we reuse across the frame, meaning that we raytrace using a limited number of ray directions. It sounds reasonable that we should be able to group rays by direction so as to enable all in a group to follow a similar path within the BVH tree and potentially hit the same triangle. Of course grouping by ray direction only is not enough, the origin of the ray matters as well, ideally we would like to group rays by both attributes.

Continue reading “Increasing wave coherence with ray binning”

# Raytracing, a 4 year retrospective

Recently I got access to a GPU that supports accelerated raytracing and the temptation to tinker with DXR is too strong. This means that I will steer away from compute shader raytracing for the foreseeable future. It is a good opportunity though to do a quick retrospective of the past few years of experimenting with “software” raytracing.

Continue reading “Raytracing, a 4 year retrospective”

# Raytraced global illumination denoising

Recently, I’ve been playing Metro: Exodus on Series X a second time, after the enhanced edition was released, just to study the new raytraced GI the developers added to the game (by the way, the game is great and worth playing anyway). What makes this a bigger achievement is that the game runs at 60fps as well. The developers, smartly, use a layered approach in calculating the GI in the game, starting with screen space raymarching the g-buffer for collisions and then resorting to tracing rays at 0.25 rays per pixel (aka raytracing at half the rendering resolution) when none is found. They also use DDGI to calculate second bounce, to light the hitpoints with indirect lighting as well, and all these working together give an overall great lighting result. While all this is very interesting, it is their approach to denoising that piqued my interest and I set about to explore it a bit more in my toy renderer. This technique is described in this presentation and expanded in this one, where from I will be borrowing some images as well.

Continue reading “Raytraced global illumination denoising”

# Abstracting the Graphics API for a toy renderer

I’ve been asked a few times in DMs what is the best way to abstract the graphics API in own graphics engines to make development of graphics techniques easier. Since I’ve recently finished a first pass abstraction of DirectX12 in my own toy engine I’ve decided to put together a post to briefly discuss how I went about doing this.

Modern, low level APIs like DX12 and Vulkan are quite verbose, offering a lot of control to the developer but also requiring a lot of boilerplate code to set up the rendering pipeline. This prospect can seem like a daunting task to people that want to use such an API and often reach out to ask what is the best way to abstract it in their own graphics engines.

Continue reading “Abstracting the Graphics API for a toy renderer”

# Shader tips and tricks

In this post I have collected some random shader tips and tricks for easy access. Most of them revolve around performance improvement, as one can imagine, with a slight bias towards GCN/RDNA and DirectX. Before I begin with the list of advice, some caveats first.