Accelerating raytracing using software VRS

I discussed in the previous post how divergence in a wave can slow down the execution of a shader. This is particularly evident during raytracing global illumination (GI) as ray directions between neighbouring wave threads can differ a lot forcing different paths through the BVH tree with different number of steps. I described how ray binning can be used to improve this but it is not the only technique we can use. For this one we will use a different approach, instead of “binning” based on the similarity of input rays we will “bin” threads based on the raytraced GI’s output. This makes sense because it is usually quite uniform, with large and sudden transitions happening mainly at geometric edges.

We can take advantage of this, varying the number rays we trace based on how uniform the GI output is, tracing fewer in such areas and more in areas of large variation (like edges). Varying the shader invocation frequency per pixel is called Variable Rate Shading (VRS), introduced in recent GPUs and graphics APIs like DX12 and Vulkan. Since the hardware supported VRS only works for pixel shaders, we will emulate this feature using “software” VRS in the compute shader. There have been a few good talks recently on the topic, this particular investigation is inspired by The Coalition’s software VRS implementation for Gears 5’s screen space global illumination.

We talked about how diffuse GI appears to exhibit transitions on “edges” and indeed identifying them is the first step we need to do. This is straight forward, we can run a Sobel filter over the previous frame’s raytraced GI output to identify the edges, based on a threshold, split the image in tiles and identify a per tile shading rate. I used DX12’s shading rate values for that purpose.

#define SHADING_RATE_1X1 0  // run shader per pixel 
#define SHADING_RATE_1X2 1  // run shader once per 1x2 pixel area 
#define SHADING_RATE_2X1 4  // run shader once per 2x1 pixel area
#define SHADING_RATE_2X2 5  // run shader once per 2x2 pixel area

A shading rate of 1×2 on the whole image, for example, effectively means that we will shade half the pixels in the vertical direction while a rate of 2×1 half the pixels in the horizontal direction. To determine this I apply the Sobel filter and calculate the two gradients in the horizontal (gx) and vertical (gy) directions.

	const float3 luminanceCoeff = float3(0.2126, 0.7152, 0.0722);
	float sobel[8];

	for (int i = 0; i < 8; i++) 
	{
		float3 gi = inputRT.Load(int3(DTid.xy + texAddrOffsets[i], 0)).rgb;
		sobel[i] = dot(gi, luminanceCoeff);
	}

	float gx = abs(sobel[0] + 2 * sobel[3] + sobel[5] - sobel[2] - 2 * sobel[4] - sobel[7]);
	float gy = abs(sobel[0] + 2 * sobel[1] + sobel[2] - sobel[5] - 2 * sobel[6] - sobel[7]);

	const float threshold =  EdgeDetectThreshold;

	uint rate = SHADING_RATE_2X2;

	if (gx > threshold && gy > threshold)
	{
		rate = SHADING_RATE_1X1;
	}
	else if ( gx > threshold)
	{
		rate = SHADING_RATE_2X1;
	}
	else if (gy > threshold)
	{
		rate = SHADING_RATE_1X2;
	}

We decide the rate based on whether the 2 gradients are above the EdgeDetectThreshold or not. This shading rate is per pixel but we need to determine a single value for the whole tile (I am using an 8×8 tile in this instance). For that, we can use some groupshared memory and use the first thread in the group to write the value out.

	InterlockedMin(ShadingRate, rate);

	GroupMemoryBarrierWithGroupSync();

	if (GroupIndex == 0)
	{
		outputRT[Gid.xy] = ShadingRate;
	}	

We conservatively select the minimum shading rate value as representative of the whole tile to ensure that we are correctly expressing detail and variation in the tile.

Now, the question is what should we use as the input image to this pass. A common option is to feed the final render target, post exposure and tonemapping. This makes sense in the general case, the image might have a depth of field or fog effect applied to it which will make parts of the screen blurred and low res, shadowed areas might need less detailed shading, similarly for uniform materials (eg same albedo, normals). While the reduced detail in the DOF blurred areas may be desirable for indirect diffuse detail reduction as well, typically detail in and contribution of the GI is more noticeable in shadowed areas and also material detail (eg albedo), which doesn’t really affect diffuse GI may influence shading rate calculations unnecessarily. So there are few things to consider and YMMV, in my case I used the denoised raytraced GI rendertarget, like the one above, as an input, with a threshold of 0.1 and a tile size of 8×8.

Red means a shading rate of 1×1, yellow 2×1, blue 1×2 and green 2×2. As expected, most of the tiles are of 2×2 shading rate, with mostly tiles that cover edges needing higher shading rates (this is a bit unintuitive, higher numbers actually mean lower shading rates as we invoke the shader less times, i.e. at a lower frequency, per pixel).

Now we can use this shading rate image to determine how to spawn rays during raytracing. A straightforward approach is to consider 2×2 pixel quads in the compute shader and kill the unneeded threads (per thread rejection) based on the shading rate for this tile.

	uint2 tileDims = uint2(SHADING_RATE_TILE_WIDTH, SHADING_RATE_TILE_HEIGHT);
	uint rate = shadingRate[screenPos.xy / tileDims].r;

    //check if we need to reject that thread. The 2x2 rate is the bitwise OR of 1x2 and 2x1
	if ((rate & SHADING_RATE_2X1) && (screenPos.x & 1))
	{
		return;
	}

	if ((rate & SHADING_RATE_1X2) && (screenPos.y & 1))
	{
		return;
	}

// trace ray for this thread 

	outputRT[screenPos.xy] = result; 

    // copy result to the other pixels according to the shading rate
	if ( rate & SHADING_RATE_1X2 )
		outputRT[screenPos.xy + uint2(0, 1)] = result;

	if (rate & SHADING_RATE_2X1)
		outputRT[screenPos.xy + uint2(1, 0)] = result;

	if (rate == SHADING_RATE_2X2)
		outputRT[screenPos.xy + uint2(1, 1)] = result;

In terms of visual quality, this is original RTGI in the final output

and this is with thread rejection

Very close visually. In this context, with this content and shading rate image, per thread rejection speeds up raytracing by about 32%.

In any other context, killing threads and allowing the wave to run half empty would be inadvisable. In the context of RTGI though, and for any technique which can introduce large thread divergence, any opportunity to decrease unpredictable thread execution cost and allow for early retirement of a wave is welcomed by the GPU.

Even though the thread rejection approach performs well, the GPU still has to run waves potentially mostly empty, which is not great. Ideally we would like to be able to reuse that resource for other work. For that we will follow The Coalition’s approach for wave rejection in Gears 5′ software VRS technique. This is a bit more involved than thread rejection, I will list some extended code snippets to explain it better.

#define THREADGROUPSIZE (GI_THREADX*GI_THREADY)

[numthreads(GI_THREADX, GI_THREADY, 1)]
void CSMain(uint3 GroupThreadID : SV_GroupThreadID, uint3 DTid : SV_DispatchThreadID, uint3 GroupID : SV_GroupID, uint GroupThreadIndex : SV_GroupIndex)
{
	uint2 screenPos = DTid.xy;

	uint2 tileDims = uint2(SHADING_RATE_TILE_WIDTH, SHADING_RATE_TILE_HEIGHT);
	uint rate = shadingRate[screenPos.xy / tileDims].r;

	const uint WaveSize = WaveGetLaneCount();
	const uint GroupWaveIndex = GroupThreadIndex / WaveSize;
	uint TotalWaveCount = THREADGROUPSIZE / WaveSize;

	if (rate & SHADING_RATE_1X2)
		TotalWaveCount /= 2;

	if (rate & SHADING_RATE_2X1)
		TotalWaveCount /= 2;

	if (GroupWaveIndex >= TotalWaveCount)
	{
		return;
	}

The wave rejection logic is pretty simple, assuming a GI_THREADX*GI_THREADY threadgroup size, how many waves do I need according to this tile’s shading rate? A shading rate of 2×2 means that I only need a quarter of the waves, a rate of 1×2 means I only need half etc. Having the number of actually needed waves in this threadgroup to cover the current shading rate, if the current thread (GroupThreadIndex) belongs to a wave that is not needed, we simply reject it. The difference with the per-thread rejection approach is that, this way, we will now reject the whole wave the thread belongs to. We are essentially packing the active threads in a reduced number of waves, instead of keeping the rejected ones around, returning the inactive waves to the pool to be used for potentially other work.

Packing the threads in a subset of the waves will need some thread index reordering to work correctly. This remapping is based on a 16×8 thread group size.

	//origin of the current threadgroup in screen space
    screenPos = GroupID.xy * uint2(GI_THREADX, GI_THREADY);

	if (rate == SHADING_RATE_2X1) // X dim coarse
		screenPos += uint2(2, 1) * uint2(GroupThreadIndex % 8, GroupThreadIndex / 8);
	else if (rate == SHADING_RATE_1X2) // Y dim coarse
		screenPos += uint2(1, 2) * uint2(GroupThreadIndex % 16, GroupThreadIndex / 16);
	else if (rate == SHADING_RATE_2X2) // X and Y dim coarse
		screenPos += uint2(2, 2) * uint2(GroupThreadIndex % 8, GroupThreadIndex / 8);
	else
		screenPos = DTid.xy;

Using the new screenPos, we can continue to generate the ray or sample it from a ray buffer and trace as usual. At the end of the shader we just copy the result to neighbouring pixels as needed, this code is the same as in the thread rejection approach above.

	outputRT[screenPos.xy] = result; 

    // copy result to the other pixels according to the shading rate
	if ( rate & SHADING_RATE_1X2 )
		outputRT[screenPos.xy + uint2(0, 1)] = result;

	if (rate & SHADING_RATE_2X1)
		outputRT[screenPos.xy + uint2(1, 0)] = result;

	if (rate == SHADING_RATE_2X2)
		outputRT[screenPos.xy + uint2(1, 1)] = result;

I mentioned above that the thread group size is set at 16×8, this is because I am running this demo on a NVidia GPU with a wave size of 32 threads. This thread group size fits exactly 4 waves and gives us the ability to repack threads and reject up to 3 waves per group. For this to work I had to tie the raytracing thread group size to the shading rate image tile size and calculate shading rate per 16×8 tiles (SHADING_RATE_TILE_WIDTH, SHADING_RATE_TILE_HEIGHT). This makes the shading rate image coarser compared to the 8×8 one, with the same threshold.

Still, there is a lot of opportunity to lower the shading rate during raytracing. The per wave rejection version of VRS reduces the RTGI cost by 27.3% while the per thread rejection by 28.1%, which is a quite similar gain (I rerun the comparison for the 16×8 threadgroup in both case — the thread rejection version now performs slightly worse because of the coarser shading rate image).

In terms of visuals, this is the wave-rejected GI image, very similar to the per thread rejected one and the original.

If per thread rejection performs similarly or even slightly better than the per wave rejection approach, why would we be interested in the latter? The reason is that per wave rejection returns the unneeded waves to the pool and reduces the overall wave allocation on the GPU, as can be seen comparing the 2 GPU traces with NSight Graphics

This brings the SM throughput down and allows more room to async overlap other work over the raytracing pass.

Both per thread and per wave rejection have a small impact on the GI noise and in the Gears 5 VRS presentation the authors mention adjusting the denoising pass based on the shading rating image but I didn’t investigate this in this instance.

To wrap things up, this is another interesting form of thread binning, based on the similarity of the output instead of the input rays, that can help bring down the cost of techniques with large thread divergence.

Accelerating raytracing using software VRS

Increasing wave coherence with ray binning

Raytracing involves traversing acceleration structures (BVH), which encode a scene’s geometry, in an attempt to identify ray/triangle collisions. Depending on the rendering technique, eg raytraced shadows, AO, GI, rays can diverge a lot in direction something. This introduces additional cache and memory pressure as rays in a wave can follow very different paths in the BVH, ultimately colliding with different triangles.

Yet, ray generation is typically based on a limited set of random samples (eg a tiled blue noise texture), which we reuse across the frame, meaning that we raytrace using a limited number of ray directions. It sounds reasonable that we should be able to group rays by direction so as to enable all in a group to follow a similar path within the BVH tree and potentially hit the same triangle. Of course grouping by ray direction only is not enough, the origin of the ray matters as well, ideally we would like to group rays by both attributes.

Continue reading “Increasing wave coherence with ray binning”
Increasing wave coherence with ray binning

Raytracing, a 4 year retrospective

Recently I got access to a GPU that supports accelerated raytracing and the temptation to tinker with DXR is too strong. This means that I will steer away from compute shader raytracing for the foreseeable future. It is a good opportunity though to do a quick retrospective of the past few years of experimenting with “software” raytracing.

Continue reading “Raytracing, a 4 year retrospective”
Raytracing, a 4 year retrospective

Raytraced global illumination denoising

Recently, I’ve been playing Metro: Exodus on Series X a second time, after the enhanced edition was released, just to study the new raytraced GI the developers added to the game (by the way, the game is great and worth playing anyway). What makes this a bigger achievement is that the game runs at 60fps as well. The developers, smartly, use a layered approach in calculating the GI in the game, starting with screen space raymarching the g-buffer for collisions and then resorting to tracing rays at 0.25 rays per pixel (aka raytracing at half the rendering resolution) when none is found. They also use DDGI to calculate second bounce, to light the hitpoints with indirect lighting as well, and all these working together give an overall great lighting result. While all this is very interesting, it is their approach to denoising that piqued my interest and I set about to explore it a bit more in my toy renderer. This technique is described in this presentation and expanded in this one, where from I will be borrowing some images as well.

Continue reading “Raytraced global illumination denoising”
Raytraced global illumination denoising

Abstracting the Graphics API for a toy renderer

I’ve been asked a few times in DMs what is the best way to abstract the graphics API in own graphics engines to make development of graphics techniques easier. Since I’ve recently finished a first pass abstraction of DirectX12 in my own toy engine I’ve decided to put together a post to briefly discuss how I went about doing this.

Modern, low level APIs like DX12 and Vulkan are quite verbose, offering a lot of control to the developer but also requiring a lot of boilerplate code to set up the rendering pipeline. This prospect can seem like a daunting task to people that want to use such an API and often reach out to ask what is the best way to abstract it in their own graphics engines.

Continue reading “Abstracting the Graphics API for a toy renderer”
Abstracting the Graphics API for a toy renderer

Occlusion and directionality in image based lighting: implementation details

I got a few follow-up questions on the blog post I published a few days ago on occlusion and directionality in image based lighting, so I put together a quick follow-up to elaborate on a few points and add some more resources.

To implement the main technique the exploration was based on, Ground Truth ambient occlusion, it is worth starting with the original paper by Jimenez et al. This is best read in conjunction with the Siggraph 2016 presentation, it will help to understand the paper better. The paper also includes fairly detailed pseudo-code for the GTAO and bent normals implementation, it will help to also use Intel’s implementation of the technique as a reference, it clarifies some parts of it as well. At the moment the sample does not seem to implement directional GTAO, in which the visibility cone is combined with the cosine and projected to SH.

Continue reading “Occlusion and directionality in image based lighting: implementation details”
Occlusion and directionality in image based lighting: implementation details

Notes on occlusion and directionality in image based lighting.

Update: I wrote a follow-up post with some implementation details and some more resources here.

Image based lighting (IBL), in which we use a cubemap to represent indirect radiance from an environment, is an important component of scene lighting. Environment lighting is not always uniform and has often a strong directional component (think sunset or sunrise or a room lit by a window) and that indirect light should interact with the scene correctly, with directional occlusion. I spent some time exploring the directionality and occlusion aspects of IBL for diffuse lighting, with a sprinkle of raytracing, and made some notes (and pretty pictures).

Continue reading “Notes on occlusion and directionality in image based lighting.”
Notes on occlusion and directionality in image based lighting.

Shaded vertex reuse on modern GPUs

A well known feature of a GPU is the post transform vertex cache, used in cases where a drawcall uses an index buffer to index the vertex to be processed, to cache the output of the vertex shader for each vertex. If subsequently the same vertex is indexed, as part of another triangle, the results are already in the cache and the GPU needs not process that particular vertex again. Since all caches are of limited capacity, rendering engines typically rearrange the vertex indices in meshes to encourage more locality in vertex reuse and better cache hit ratio.

Continue reading “Shaded vertex reuse on modern GPUs”
Shaded vertex reuse on modern GPUs

The curious case of slow raytracing on a high end GPU

I’ve typically been doing my compute shader based raytracing experiments with my toy engine on my ancient laptop that features an Intel HD4000 GPU. That GPU is mostly good to prove that the techniques work and to get some pretty screenshots but the performance is far from real-time with 1 ray-per-pixel GI for the following scene costing around 129 ms, rendering at 1280×720 (plugged in).

Continue reading “The curious case of slow raytracing on a high end GPU”
The curious case of slow raytracing on a high end GPU