Low-level thinking in high-level shading languages 2023

Low-level thinking in high-level shading languages” (Emil Persson, 2013), along with its followup “Low-level Shader Optimization for Next-Gen and DX11“, is in my top 3 most influential presentations, one that changed the way I think about shader programming in general (since I know you are wondering the other 2 are Natty Hoffman’s Physically Based Shading and John Hable’s Uncharted 2 HDR Lighting). When I started graphics programming shaders were handcrafted in assembly, the HLSL compiler being in its infancy. It used to be the case that you could beat the compiler and manually produce superior shader assembly. This changed over the years, the compiler improved immensely and I learned to rely more on it and not pay much attention to, or think about the produced assembly code.

This, and the followup, is a presentation that I recommend as required reading to people wanting to get deeper into shader programming, not just for the knowledge but also the attitude towards shader programming (check compiler output, never assume, always profile). It has been 10 years since it was released though; in those 10 years a lot of things have changed on the GPU/shader model/shader compiler front and not all the suggestions in those presentations are still valid. So I decided to do a refresh with a modern compiler and shader model to see what still holds true and what doesn’t. I will target the RDNA 2 GPU architecture on PC using HLSL, the 6.7 shader model and the DXC compiler (using https://godbolt.org/) in this blog post.

One of the big themes, if not the biggest, in the presentation was that the compiler, however good, can’t always optimise the code in a way it should, and among the many examples of this is its inability to simplify operations to multiply-add instructions although it could. Starting with a simple 2-operation HLSL instruction (which could clearly implement as a multiply with a simple refactoring)

result = (x + a) * b;

the compiler will return 2 separate instructions, an add and a multiply. It doesn’t rearrange the operations to use a single multiply-add instruction instead.

  v_add_f32     v0, s0, v0                             
  v_mul_f32     v1, s1, v0                             

Does it matter if we use literal constants instead?

result = (x + 1.2) * 3.2;

It doesn’t.

 v_add_f32     v0, lit(0x3f99999a), v0   
 v_mul_f32     v1, lit(0x404ccccd), v0 

The compiler won’t change the order of floating point operations in case this has an impact on the result (this document has more details on the why, here is an example of this in action). The shader author knows the problem they are trying to solve and is best placed to do the re-ordering manually. This happens very commonly in shaders, to give you a real world scenario, the typical lighting formula which looks something like this (for the diffuse component)

float3 diffuse = lightIntensity * lightColour.rgb * shadowFactor * (albedo.rgb/PI) * NdotL;

often mixes float3 and float operands, which the compiler won’t reorder, and this particular formulation produces 10 instructions. Simply rearranging the operands to group float and float3 operations

float3 diffuse = lightIntensity * shadowFactor * NdotL / PI * lightColour.rgb  * albedo.rgb ;

drops the number of instructions to 6.

This ties well with another good practice during shader development, i.e. to finish the derivations and implement features using the minimum number of operations needed as the compiler won’t do it for you.

Instruction modifiers is still a great way to get a lot of operations for free. For example saturate(), x2, x4, /2, /4 on output and abs(), negate on input are free and and can allow for a lot extra mileage out of an instruction for example this

saturate(4*(-x*abs(z)-abs(y)))

is just a single instruction

v_fma_f32     v1, -v2, abs(v0), -abs(v1) mul:4 clamp

The “on input” or “on output” part is important and means applying the modifier on each operand separately or on the result of the whole operation. For example using saturate on an input:

float x = saturate(a) + b;

will force an extra instruction and a copy to a register

v_max_f32     v1, s0, s0 clamp                       
v_add_f32     v1, v0, v1 

exp() and log() are still implemented using exp2() and log2()

//exp()
v_mul_legacy_f32  v0, lit(0x3fb8aa3b), v0  // premultiply by 1.4427         
v_exp_f32     v1, v0

//log()
v_log_f32     v0, v0                         
v_mul_f32     v1, lit(0x3f317218), v0  // post multiply by 0.693147

pow(x,y) is calculated using exp2 and log2

  v_log_f32     v1, v2                                 
  v_mul_legacy_f32  v0, v0, v1                         
  v_exp_f32     v1, v0 

The compiler will calculate integer literal powers up to 4 (eg pow(x, 3)) using multiplies and then it will revert back to exp(log()). Any negative power value will also use exp(log()). Curiously, it will calculate pow(x,1) using exp(log()) as well, worth avoiding it altogether.

//pow(x,1)
v_log_f32     v0, v0            
v_exp_f32     v1, v0  

1/sqrt(x) now yields a “reverse square root” instruction.

// 1/sqrt(x)
v_rsq_f32     v1, v0  

sign() and inline conditionals are quite similar in terms of instructions produced (sign() will also handle the case where x==0)

  //sign(x)
  v_cmp_gt_f32  vcc, 0, v0                               
  v_cndmask_b32  v1, 1.0, -1.0, vcc                 
  v_cmp_neq_f32  vcc, 0, v0                             
  v_cndmask_b32  v0, 0, v1, vcc                         
  v_trunc_f32   v1, v0   

  //x >=0?1:-1
  v_cmp_le_f32  vcc, 0, v0                              
  v_cndmask_b32  v1, -1.0, 1.0, vcc             
  v_mov_b32     v0, lit(0x00003c00)            
  v_cndmask_b32  v1, lit(0x0000bc00), v0, vcc

but the latter has an advantage when we don’t care about x==0 and also when used in operations, such as sign(x) * y

  //sign(x) * y
  v_cmp_gt_f32  vcc, 0, v1 
  v_cndmask_b32  v2, 1.0, -1.0, vcc 
  v_cmp_neq_f32  vcc, 0, v1             
  v_cndmask_b32  v1, 0, v2, vcc 
  v_trunc_f32   v1, v1                 
  v_mul_f32     v1, v0, v1       

  // x>=0 ? y:-y
  v_cmp_le_f32  vcc, 0, v1                      
  v_cndmask_b32  v1, lit(0x80000000), 0, vcc           
  v_xor_b32     v1, v0, v1 

Inverse trigonometric function are still not native and continue to be bad. This is acos() for example:

  v_mov_b32     v1, lit(0x3be3b0b4)                      
  s_waitcnt     vmcnt(0)                                
  v_fma_legacy_f32  v1, lit(0xbaac860d), abs(v0), v1     
  v_fma_legacy_f32  v1, v1, abs(v0), lit(0xbc90489a)     
  v_fma_legacy_f32  v1, v1, abs(v0), lit(0x3d0070e2)     
  s_denorm_mode  0x000f                                 
  v_add_f32     v2, -abs(v0), 1.0                        
  v_fma_legacy_f32  v1, v1, abs(v0), lit(0xbd4e589e)     
  v_sqrt_f32    v3, v2                                   
  v_fma_legacy_f32  v1, v1, abs(v0), lit(0x3db64f94) 
  v_cmp_neq_f32  vcc, 0, v2                              
  v_fma_legacy_f32  v1, v1, abs(v0), lit(0xbe5bc07d)   
  v_fma_legacy_f32  v1, v1, abs(v0), lit(0x3fc90fdb)     
  v_cndmask_b32  v2, 0, v3, vcc                          
  v_mul_legacy_f32  v1, v1, v2                           
  v_sub_f32     v2, lit(0x40490fdb), v1                 
  v_cmp_gt_f32  vcc, 0, v0                               
  v_cndmask_b32  v1, v1, v2, vcc  

It is worth avoiding inverse trigonometric functions altogether if possible. If you absolutely need to use them in shaders, it is preferable to evaluate approximations instead but always profile to see if there actually any perf gain.

Even native instructions like sin()/cos()/log()/sqrt() as well as rcp() can come at a cost as they are quarter rate on RDNA (the GPU can issue one every 4 clocks), something to keep in mind as it can increase the instruction latency.

Rolling out your own matrix/vector multiply function can still save you 3 instructions, the compiler won’t treat a .w of “1.0” any differently.

  //mul(float4(a, 1.0), m)
  v_mul_f32     v3, s4, v0                        
  v_mul_f32     v4, s8, v0                          
  v_mul_f32     v0, s0, v0                         
  v_fmac_f32    v3, s5, v1                         
  v_fmac_f32    v4, s9, v1                       
  v_fmac_f32    v0, s1, v1                        
  v_fmac_f32    v3, s6, v2                       
  v_fmac_f32    v4, s10, v2                       
  v_fmac_f32    v0, s2, v2                         
  v_fma_mixlo_f16  v1, s7, 1.0, v3                       
  v_fma_mixlo_f16  v0, s3, 1.0, v0                     
  v_fma_mixhi_f16  v1, s11, 1.0, v4        

  //a.x * m[0] + (a.y * m[1] + (a.z * m[2] + m[3]))
  v_fma_f32     v3, v2, s6, s7                          
  v_fma_f32     v4, v2, s10, s11                      
  v_fma_f32     v2, v2, s2, s3                          
  v_fmac_f32    v3, s5, v1                              
  v_fmac_f32    v4, s9, v1                              
  v_fmac_f32    v2, s1, v1                              
  v_fma_mixlo_f16  v1, v0, s4, v3                  
  v_fma_mixhi_f16  v1, v0, s8, v4                  
  v_fma_mixlo_f16  v0, v0, s0, v2        

In terms of sharing subexpressions between some functions it is still true that the compiler will reuse parts of length() and distance()

// a = length(x-y)
// b = distance(x,y)
  v_subrev_f32  v1, v2, v5                               
  v_subrev_f32  v2, v3, v6                               
  v_mul_legacy_f32  v0, v1, v1                         
  v_subrev_f32  v1, v4, v7                               
  v_fmac_legacy_f32  v0, v2, v2                       
  v_fmac_legacy_f32  v0, v1, v1                        
  v_sqrt_f32    v1, v0             

but only if the order matches (eg, it won’t for distance(y, x)). This is not true for normalize() and length() though, they both seem to do their own calculations, instead of using the output of length() for the normalization, in all likelihood due to floating point precision issues between calculating “reverse sqrt()” and 1/sqrt() — although as we saw above the compiler will calculate 1/sqrt() with the “reverse sqrt()” instruction.

 //a = length(x)
 //b = normalize(x)
  v_mul_legacy_f32  v3, v0, v0                       
  v_fma_f32     v4, v1, v1, v3                          
  v_fmac_legacy_f32  v3, v1, v1                     
  v_fmac_f32    v4, v2, v2                              
  v_fmac_legacy_f32  v3, v2, v2                    
  v_rsq_f32     v4, v4                                  
  v_sqrt_f32    v3, v3                                  
  v_mul_legacy_f32  v0, v0, v4                   
  v_mul_legacy_f32  v1, v1, v4                   
  v_mul_legacy_f32  v2, v2, v4      

In HLSL the [loop] directive is now even more of hint to the compiler to keep a loop in the code and the compiler seems to mostly ignore it if the loop count is known at compile time.

    [loop]
    for(int i = 0; i < N; i++)
    {
        float3 v = tex.SampleLevel(state, input.uv + i,0).rgb;
        result.xyz += mul(float4(v.rgb, 1), m).xyz;
    }

The compiler will fully unroll the loop for N up to ~ 50 and then partially unroll it for greater values. This value might change if the VGPR allocation in the shader is higher. The compiler will keep the loop with an unknown loop count though, with or without the [loop] directive.

It used to throw an error when one tried to force a [loop] or a [branch] when sampling a texture with uv coordinates calculated in the branch, without specifying a mip level (i.e. SampleLevel()) as in the example below

[branch]
if ( data.x > 0 )
{
    float2 uv =  input.uv * data.x;
    result.xyz = tex.Sample(state, uv).rgb;
}
 
  v_cmpx_ngt_f32  exec, v1, 0                            
  v_mov_b32     v3, 0                                   
  v_mov_b32     v4, 0                                   
  v_mov_b32     v5, 0                                   
  s_cbranch_execz  label_006C                            
label_006C:
  s_andn2_b64   exec, s[2:3], exec  // mark active threads that passed the test in the execution mask                     
  s_cbranch_execz  label_00A8                           
  s_mov_b64     vcc, exec                                
  s_wqm_b64     exec, vcc   // Force all pixels in a quad (that contains a pixel that passed the test) ON                    
  v_mul_f32     v2, v2, v1     // Calculate uv coordinates for all pixels in the quad                 
  v_mul_f32     v0, v0, v1                               
  s_mov_b64    exec, vcc     // Now activate the pixels in the quad/wave that have actually passed the test.                    
  s_load_dwordx8  s[4:11], s[0:1], null                  
  s_load_dwordx4  s[12:15], s[0:1], 0x000040             
  s_waitcnt     lgkmcnt(0)                      
  // Sample texture only for pixels that have passed the test but with always correct mip level.         
  image_sample  v[3:5], [v2,v0], s[4:11], s[12:15] dmask:0x7 dim:SQ_RSRC_IMG_2D  
label_00A8:
  s_mov_b64     exec, s[2:3]                             

The problem with this code is that for threads/pixels that won’t follow the branch, the uv coordinates are undefined so the shader can’t calculate the uv derivatives to calculate the mip level. Instead of failing, the compiler does something interesting, it forces all the pixels in the quad on, makes them calculate the uv coordinates regardless of whether they will follow the branch and then take the branch as needed. Bottom line is that [loop] and [branch] might not have the result you expect and it is worth inspecting the final code to determine what they do.

Unfortunately integer division is still really bad. I won’t fill the post with shader ISA but a simple A/B division, both scalar integers, will emit around 34 instructions. Unsigned ints are slightly better (but still expensive) at 24 instructions. Regarding mul24() mentioned for faster integer multiplication, it doesn’t seem to be supported any more. You could use 16 bit data types instead (-enable-16bit-types) and the new uint16_t/int16_t data types to speed up integer operations (division is still bad though).

Cubemap sampling still has an overhead in terms of ALU to calculate the appropriate texture coordinates

//cubeTex.Sample(state, coords.xyz).xyz;
  
  v_cubema_f32  v3, v0, v1, v2                          
  v_rcp_f32     v3, abs(v3)                              
  v_cubetc_f32  v4, v0, v1, v2                          
  v_cubesc_f32  v5, v0, v1, v2                      
  v_cubeid_f32  v0, v0, v1, v2                       
  v_fmaak_f32   v1, v4, v3, lit(0x3fc00000)    
  v_fmaak_f32   v2, v5, v3, lit(0x3fc00000)     
  s_and_b64     exec, exec, s[20:21]                
  image_sample  v[0:2], [v2,v1,v0], s[12:19], s[0:3] dmask:0x7 dim:SQ_RSRC_IMG_CUBE 

Register indexing is supported (eg for register arrays) but is another example of why one should always inspect the output ISA to understand what the compiler is doing. Indexing the components of a vector using an index unknown to the compiler (eg coming from a constant buffer) is not supported and will emit this code which checks every possible index value.

 float4 result = tex.Sample(state, input.uv);
 return result[index]; // index from constant buffer

  s_cmp_eq_i32  s0, 2                                
  s_cselect_b64  s[2:3], s[16:17], 0              
  s_cmp_eq_i32  s0, 1                                  
  s_cselect_b64  s[4:5], s[16:17], 0                  
  s_cmp_eq_i32  s0, 0                  
  s_cselect_b64  vcc, s[16:17], 0                      
  v_cndmask_b32  v2, v3, v2, s[2:3]                   
  v_cndmask_b32  v1, v2, v1, s[4:5]                   
  v_cndmask_b32  v1, v1, v0, vcc    

Changing it slightly to index into an array of floats instead will use proper register indexing:

 float result[4];
 return result[index] // index from constant buffer

 s_cmp_lt_u32  s0, 4                                   
 s_cbranch_scc0  label_0084                           
 s_mov_b32     m0, s0                                                          
 v_movrels_b32  v4, v0  // use v0 as an index into the register array starting at v4.       

About indexing in general, for both register arrays and textures/buffers, another thing worth mentioning is that every time we use a floating point as a index that’ll force a conversion to int (v_cvt_u32_f32) and potentially a register allocation. It is preferable to pass indices though buffers in integer formats to begin with.

Worth briefly discussing about scalar vs vector operations mentioned in the presentation: that is done mainly from the point of view of single floats vs vectors of floats (i.e float4). Under the hood all instructions are scalar, i.e. work on a single float value (or int), you can see it with a simple matrix-vector multiplication for example.

result.xyz = mul(float4(vec.xyz, 1), m).xyz;

v_fma_f32     v3, v0, s4, s7        // Each register vN holds a single float value    
v_fma_f32     v4, v0, s8, s11           
v_fma_f32     v0, v0, s12, s15          
v_fmac_f32    v3, s5, v1                
v_fmac_f32    v4, s9, v1                
v_fmac_f32    v0, s13, v1               
v_fmac_f32    v3, s6, v2                
v_fmac_f32    v4, s10, v2               
v_fmac_f32    v0, s14, v2               
v_add_f32     v1, 1.0, v3     

What is more interesting is how those floats are stored in the above example and this is in something called VGPRs (Vector General Purpose Register, using the letter “v” above) and SGPRs (Scalar General Purpose Register, using the letter “s” above). A single VGPR holds one float value for each thread of the wave, while a single SGPR holds one float value common to and accessible by all threads in a wave.

With a VGPR, a thread can only access the float value that corresponds to its index, eg thread 0 can’t access the second float in VGPR0 (this is not strictly true, more on this later). SGPRs typically hold data common to all waves like the matrix m above, originating from a constant buffer (more on this here). To cut a long story short, SGPR instructions (starting with s_ in the assembly code above), registers and cache have dedicated units in the RDNA and GCN GPUs and are great to offload work common to all threads there to reduce pressure on the main vector pipeline (instructions starting with v_ in the code above). This is again something the shader author should be aware of and work with the compiler to achieve. For example a read from a constant buffer will be a scalar load and the result will be stored in SGPRs. If the constant buffer contains an array on the other hand, it depends on whether the index is common to all threads (and the compiler can prove that statically — eg from a constant buffer or a literal) or can vary per thread:

cbuffer Data
{
    float4 values[30];
};

result = values[i]; // index is provably the same for all threads (eg a literal or from a constant buffer)
s_buffer_load_dwordx4  s[0:3], s[0:3], s4 // scalar load

result =  values[i]; // index changes per thread, or the compiler can't infer if the same for all threads
tbuffer_load_format_xyzw  v[0:3], v0, s[0:3], 0 offen format:[BUF_FMT_32_32_32_32_FLOAT]   // vector load

In case the compiler knows that the index is constant to all the threads in the wave it will take a fast path and do a scalar load, else it will issue a vector load. Same with a Structured Buffer, the compiler can issue either a scalar or vector load depending on the index. The same is not true for a typed buffer though, the compiler will issue a vector load in both cases. A quick note that an array in a constant buffer will be performant for sequential access, random access may come with a performance penalty on some platforms, better use different type of buffer if that is your use case.

Finally, the scalar unit doesn’t support all types of Maths operations.

int result = a * b; // a,b integers
s_mul_i32     s0, s0, s1 // operation can be performed on the scalar unit fully. Same for addition.

float result = a * b; //a, b float
v_mul_f32     v0, s0, s1 // operation not supported, it has to use a vector instruction even if source the same for all threads.

Integer division on the scalar unit is even worse than the vector unit, ~46 instructions and involves a mix of scalar and vector operations and conversions between them, one to avoid.

In general I found that most of the things mentioned in the follow-up presentation (Low-level Shader Optimization for Next-Gen and DX11) still hold today as well, so I’d like to focus next on what has changed since.

GPUs have evolved massively since and have become very wide (i.e. ALU capacity has increased a lot), but memory latency/bandwidth and fixed function units haven’t scaled by the same amount. Some back of the napkin Maths, using stats from https://www.techpowerup.com/gpu-specs/ for example, show that the ALU (GFlops/sec) to Memory bandwidth (GB/sec) ratio has increased from AMD Radeon 4870’s (the GPU referenced in the original presentation) 10 to 76 for a recent Radeon RX 7600 and the ALU to Texture Rate (GTex/sec) ratio from 40 to 65. This means that rendering passes that rely on fixed functionality a lot will struggle to fill the GPU with work even more.

With the arrival of DirectX 12 graphics programmers have more control over how they set up and feed data to the shaders but this also means that there are more things peripheral to shader authoring that can affect its performance. For example the followup presentation on DX11 talks about descriptors and binding resources to shaders, DirectX 12 makes the interaction with shaders and how we feed data to it with root signatures and a choice of root parameter types, more explicit. The most efficient way to pass a piece of data to the shader is directly in the root signature, the most expensive via a double indirection with a descriptor table. Things can get even more expensive if due to a high number of root parameters part of the root signature has to be stored to memory instead of SGPRs. It is worth spending some time studying the different ways and the pros/cons of each and how it affects performance. Barriers and resource transitions can affect an execution of a drawcall/dispatch and can hinder parallelisation of work on the GPU and now need extra consideration.

GPUs will overlap shader work from different drawcalls/dispatches if there are no dependencies but now we have the ways to make that overlap more explicit with Async Compute. This is great for complementary workloads that are bottlenecked by different resources/fixed units, for example a shadowmap rendering pass will likely be limited by the vertex rate where an SSAO pass mainly by ALU (and maybe texture reads) and could be overlapped to make use of the unused GPU resource. Worth bearing in mind that the workload assigned to the compute queue should be sizeable (avoid small dispatches) and also that the same task running async on the compute queue will take longer than when ran on graphics queue. Your mileage may vary with Async compute based on type of workload and platform, as always profiling will be needed to determine actual benefits.

Material complexity and lighting model complexity has increased a lot since, increasing both ALU but also texture reads in a shader. Also, divergence in shaders has increased a lot, screen space lighting techniques (SSAO, SSR, raymarched shadows) and raytracing more recently have reduced execution and data coherence among wave threads. There is a plethora of ways to address this, worth looking into, for example.

  • Tile classification: instead of a single complex shader with all possible functionality on dynamic branches, classify image tiles based on required shader functionality and create a set of simpler shaders to implement it. This brings down the shader complexity and also VGPR allocation which may improve occupancy and memory latency (discussed below).
  • Binning: implemented in the context of raytracing but with wider applicability, the idea behind this is to reorder the thread indices in a wave to bring similar input data (rays in this context that point towards the same general direction) closer together to reduce divergence and increase cache coherence.
  • Variable rate shading: this technique does not affect the shader functionality per se, but the rate the shader is executed, based on the similarity of the output, i.e a shader can be executed per pixel, every 2 pixels and the result shared, every 4 pixels etc. It can be either hardware based, for pixel shader or software based for compute shaders.

Memory latency can affect shader performance significantly, especially with the increased shader complexity discussed above and the increased amount of texture read instructions. Occupancy can be important to shader performance, as it is an indication (but not the only one) of how well the GPU can hide memory latency. In short this refers to the amount of waves the SIMD can have in flight (warmed up and ready to run) in case a wave/instruction gets blocked by a memory read. It is mainly affected by the number of vector registers (VGPRs) allocated by the shader. This is an area that get possibly improved by using the advice in the original presentation and simplifying the ALU work. Occupancy is not the only measure of how well memory latency can be hidden though. The compiler will also try to put as much space (i.e. instructions) between the memory read issue and the memory used instruction.

float3 vec = tex.Sample(state, input.uv).rgb; // issue a texture read
result.xyz += mul(float4(data.xyz, 1), m).xyz;
result.xyz += mul(float4(vec.xyz, 1), m).xyz; // use the result

image_sample  v[0:2], [v2,v0], s[4:11], s[12:15] dmask:0x7 dim:SQ_RSRC_IMG_2D // issue memory read
s_buffer_load_dwordx8  s[4:11], s[0:3], 0x000010      
s_buffer_load_dwordx8  s[12:19], s[0:3], 0x000030     
s_waitcnt     lgkmcnt(0)                              
v_mov_b32     v3, s11                                 
v_mov_b32     v4, s15                                 
v_mov_b32     v5, s19                                 
v_fma_f32     v3, s4, s8, v3                          
v_fma_f32     v4, s4, s12, v4                         
v_fma_f32     v5, s4, s16, v5                         
v_fma_f32     v3, s9, s5, v3                          
v_fma_f32     v4, s13, s5, v4                         
v_fma_f32     v5, s17, s5, v5                         
v_fma_f32     v3, s10, s6, v3                         
v_fma_f32     v4, s14, s6, v4                         
v_fma_f32     v5, s18, s6, v5                         
s_waitcnt     vmcnt(0)            // block until data has arrived                    
v_fma_f32     v6, v0, s8, s11   // use the result                      
// rest of instructions

In the above example the compiler can insert a lot of instruction between the memory read instruction until the data is needed, so the impact of memory latency is reduced. There are things we can do to help it, such as partially unrolling loops to give it more instructions to work with. Also we may need to manually rearrange instructions to give the compiler more opportunities to hide some latency. In the example above, if I switch the order of the two operations as such

float3 vec = tex.Sample(state, input.uv).rgb; // issue a memory read
result.xyz += mul(float4(vec.xyz, 1), m).xyz;  // use the data first
result.xyz += mul(float4(data.xyz, 1), m).xyz;

image_sample  v[0:2], [v2,v0], s[4:11], s[12:15] dmask:0x7 dim:SQ_RSRC_IMG_2D // issue read
s_buffer_load_dwordx8  s[4:11], s[0:3], 0x000030      
s_buffer_load_dwordx8  s[12:19], s[0:3], 0x000010     
s_waitcnt     lgkmcnt(0)                              
v_mov_b32     v4, s11                                 
v_fma_f32     v4, s8, s12, v4                         
v_fma_f32     v4, s9, s13, v4                         
v_fma_f32     v4, s10, s14, v4                        
s_waitcnt     vmcnt(0)          // block until the data is ready                  
v_fma_f32     v5, v0, s16, s19    // we need to use the data now                    
v_fma_f32     v3, v0, s8, s11                         
v_fma_f32     v0, v0, s4, s7                          
v_fmac_f32    v5, s17, v1                             
v_fmac_f32    v3, s9, v1                              
v_fmac_f32    v0, s5, v1                              
v_mov_b32     v1, s7                                  
v_fmac_f32    v3, s10, v2                             
v_fmac_f32    v5, s18, v2                             
v_fma_f32     v1, s4, s12, v1                         
v_add_f32     v3, 1.0, v3                             
v_fmac_f32    v0, s6, v2                              
v_fma_f32     v1, s5, s13, v1                         
v_add_f32     v3, v3, v4                              

now there are many less instructions to insert between the memory read and the block instruction because the compiler won’t change the order of the operations in case it affects the result (as discussed above), this a decision the shader author will have to make. There can be side-effects though, increased occupancy may lead to cache trashing, as a large number of waves compete for cache access. Additionally, once, when we removed a long inactive branch in a shader to reduce VGPR allocation and improve occupancy, the shader cost went up not down because the compiler had less VGPR to cache the results of the memory reads to it had to issue a read, get the result, use it, and free up the VGPR to reuse for another memory read, effectively serialising them. I can’t stress the need to profile any change you make enough, it is likely that the result defies expectation.

Another new tool in the shader author’s toolbox is Wave Intrinsics which allow the threads within a wave to talk to each other and exchange data. Wave intrinsics have many uses, which revolve around using VGPRs (which is the fastest form of storage) to store and share intermediate data instead of groupshared memory or plain memory and making decisions about the state of each thread in a wave and perform collective actions based on that. I have discussed how they can be used for example to implement stream compaction in a previous post, using a per wave atomic instead of one per thread. We discussed earlier how the compiler can make decisions on whether to use the scalar pipeline (particularly loads) if it can infer that the index is wave invariant. This involves mainly indices from constant buffers and literals, but wave intrinsics offer new ways to communicate this to the shader, even for data sources that we have no prior knowledge of uniformity.

int i =  ... // some index value that may or may not change per thread 

int i0 = WaveReadLaneFirst(i); // get the index's value for the first thread in the wave

if ( WaveActiveAllTrue( i == i0 ) ) // if all indices are the same as the first one issue a scalar load
{
	result = structured_buffer[i0];     
}
else // else issue a vector load
{
	result = structured_buffer[i];   
}  

  v_readfirstlane_b32  s0, v0 // get index for first thread                        
  v_cmp_eq_i32  vcc, s0, v0  // compare to indices from all threads                           
  s_mov_b32     s2, s3                                  
  s_mov_b32     s3, s1                                  
  s_load_dwordx4  s[12:15], s[2:3], 0x00                
  s_cmp_eq_u64  exec, vcc                               
  s_cbranch_scc1  label_000F                          
  s_waitcnt     lgkmcnt(0)                              
  tbuffer_load_format_x  v2, v0, s[12:15], 0 idxen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]  // if not all the same, do a vector load 
  s_branch      label_0015                             
label_000F:
  s_lshl_b32    s0, s0, 2                              
  s_waitcnt     lgkmcnt(0)                             
  s_buffer_load_dword  s0, s[12:15], s0 // if all indices are the same do a scalar load.            
  s_waitcnt     lgkmcnt(0)      

This idea can be expanded into more complex scenarios and is called scalarisation, which involves having cheaper and more expensive paths in the shader and making decisions about which to follow in the runtime, using wave intrinsics to determine data variance between the threads in a wave.

The last new feature I would like to briefly talk about is support for 16 bit floating point numbers (fp16). On paper, a great feature of the fp16 representation are that it allows packing two 16 bit fp numbers into a single 32 bit register, reducing the VGPR allocation for a shader and increasing occupancy, and also allows reduction of ALU instruction count by performing instructions to packed 32 bit registers directly (i.e. affecting the two packed fp16 numbers independently). Your mileage will vary a lot with fp16, it needs quite a bit of planning for all stages of the pipeline, resources and shaders to ensure that all data and operations remain in fp16, else the shader will be spending time converting between fp16 and fp32.

I haven’t covered all recent developments that may affect shader authoring but the post is getting a bit long. I have collected some more good practices during shader development if you are interested.

To try to summarise the original presentation’s point, as I understood it, and this post’s: it is less about saving the odd ALU instruction but more about understanding the tools/platform you are working with and working with the compiler instead of blindingly relying on it.

  • Compiler technology has evolved significantly since the early days but the compiler can’t know the author’s intent. We need to work with the compiler to achieve the best code and performance result.
  • Don’t make assumptions about what the compiler will do, instead learn to read and understand the compiler’s output where you have the opportunity to do so.
  • You don’t always have to worry about the extra instructions. This is more about developing good habits, eg batching operations by type, not using integer division and too much inverse trigonometry etc.
  • What is more important is understanding the bottlenecks in each case and making sure that GPU resource is not wasted (ALU, bandwidth etc)
  • Always profile to see the impact of any shader change/performance improvement, the result may surprise you.
  • Look at the shader execution in context, there may be other things outside it that may affects its performance.


Low-level thinking in high-level shading languages 2023

A gentler introduction to ReSTIR

Recently I started exploring ReSTIR, using mainly the Gentle Introduction to ReSTIR Siggraph course and the original paper. I began with direct illumination (ReSTIR DI), to quickly set it up and get something working. ReSTIR is a very interesting technique that gives great results but there is a lot of Maths behind it that might dissuade people that want to dip their toes in it, which is a shame. Resources like the Gentle Introduction help a lot towards clarifying some of the theory behind it but it is still Maths heavy. In this post I will be attempting a more “qualitative” discussion of ReSTIR, going straight to the results, avoiding referencing the Maths behind it too much.

Let’s consider one of the still hard to solve problems in real-time graphics, how to light a scene with a very large number of shadowed lights. For example this is Sponza with 400 (unshadowed) point lights.

The scene looks wrong and unnatural without shadows, with light leaking through walls and pillars. What we would really like instead is this, in which every light is correctly shadowed:

Calculating shadows for a very large number of lights, either through shadowmaps or raytracing, can be very expensive though, in both memory and performance, and for this reason, for real time graphics, we tend to keep the number of shadow casting lights low. There is also the issue of sources of lights that are not really point lights, for example emissive surfaces/area lights, which for it is even harder to calculate shadows. If we can’t raytrace shadows for a large number of lights, what can we do with the typically low per pixel ray budgets? It turns out quite a lot.

Enter ReSTIR. Key to ReSTIR is the assumption that we shouldn’t need to process many lights in the scene per pixel but only a small subset of randomly chosen ones, and from this subset only select one, as the “most important” (important in this context means it has the biggest influence on the surface) and representative which we store in a structure called the “Reservoir”. At the heart of ReSTIR is a technique called Weighted Reservoir Sampling (WRS). ReSTIR builds on this technique to add reservoir reuse across time and space.

Setting up Weighted Importance Sampling is relatively straight forward (pseudocode from this paper).

Like mentioned, we store a Reservoir per pixel and each Reservoir holds the light index Y of the most influential light in this pixel and its weight W_y.

struct Reservoir
{
    uint Y; // index of most important light
    float W_y; // light weight
    float W_sum; // sum of all weights for all lights processed
    float M; // number of lights processed for this reservoir
};

bool UpdateReservoir(inout Reservoir reservoir, uint X, float w, float c, inout RngStateType rngState)
{
    reservoir.W_sum += w;
    reservoir.M += c;

    if ( rand01(rngState) < w / reservoir.W_sum  )
    {
        reservoir.Y = X;
        return true;
    }

    return false;
}

rand01() is a method that returns a uniformly sampled random number between 0 and 1. This is used to randomly select a light, based on its weight w. Worth noting that the larger the weight (importance) of a light the more likely it is to be selected.

With this at hand we can go ahead and implement weighted reservoir sampling (pseudocode from the same paper):

Since I promised that this is a mostly qualitative introduction without Maths, I did a quick annotation of the pseudocode to explain what each term is. In code it would look something like that

    float pdf = 1.0 / N;
    float p_hat = 0;
    
    //initial selection of 1 light of M
    for (uint i = 0; i < M; i++)
    {
        uint lightIndex = uint(rand01(rngState) * (NoofPoint - 1));
                      
        p_hat = length(GetPointLightRadiance(LightsBuffer[lightIndex], worldPos, CameraPos.xyz, surfaceData));
          
        float w = p_hat / pdf;
        
        UpdateReservoir(reservoir, lightIndex, w, 1, rngState);
    }

    if (IsReservoirValid(reservoir))
    {
        PointLightData lightData = LightsBuffer[GetReservoirLightIndex(reservoir)];

        RayDesc ray;
        ray.Origin = worldPos.xyz;
        ray.TMin = 0.05;
    
        ray.TMax = length(lightData.Position.xyz - worldPos.xyz);
        ray.Direction = normalize(lightData.Position.xyz - worldPos.xyz);
    
        float shadowFactor =  FindHit(Scene, ray); // is this ray occluded?
                    
        //pixel radiance with the selected light
        float3 radiance = shadowFactor *   GetPointLightRadiance(LightsBuffer[GetReservoirLightIndex(reservoir)], worldPos, CameraPos.xyz, surfaceData);

        p_hat = length(radiance );
            
        // calculate the weight of this light
        reservoir.W_y = p_hat > 0.0 ? rcp(p_hat) * reservoir.W_sum / reservoir.M : 0.0;
       
        // apply it to the radiance to get the final radiance.
        radiance  *=  reservoir.W_y;
    }

Let’s consider the “simple” case of lighting the scene with only N point lights, 400 of them in this instance. Also let’s assume that we will randomly pick M=32 of them for each pixel. Since we are using the uniform distribution to randomly select lights (i.e. each light has the same probability of being selected) from a total of N, the pdf p(x) is 1/N. Worth noting that we select 32 lights knowing nothing about them, how far away they are, if they are fully occluded or not, we are even unaware if we are selecting the same light multiple times.

The “p_hat” quantity used in the light weight is more interesting. This is the measure of how “important” the light is for this pixel and to approximate that we use the length of the output radiance at the pixel (i.e. the output of the brdf). Above, I have included both diffuse and specular response to the GetPointLightRadiance() that feeds into p_hat. It is also important to add light attenuation and ideally visibility to p_hat (more on this later). How does the p_hat quantity represent the “importance” of the light? If a blue light shines on a red albedo surface the output will be dark and the weight (importance) of the light reduced for example. Or if the light is far away, attenuation will reduce its intensity scaling down the brdf response.

I mentioned that, ideally, we would like to add light visibility (shadows) to p_hat as well, but it might not be possible for all M lights due to cost. Instead we calculate the light weight without visibility during WRS and then apply shadows once to the selected and (we hope) most important light. This is not always accurate though as the light response might be strong on the surface but it might ultimately be occluded. This is a source of noise as we will discuss later.

Finally we calculate the weight reservoir.W_y for the selected light and apply it to the radiance to get the final radiance at the pixel. If the light is shadowed, the radiance will be zero.

A quick warning here, if you usually store your lights in a constant buffer, a common approach with current lighting techniques, you may notice a sever performance degradation in this context of random selection and access of lights. In this case better store the lights in a Structured buffer instead which can support random access.

Like any stochastic technique, ReSTIR can suffer from noise and bias (this is where good understanding of the Maths behind ReSTIR becomes important). Noise is easy to understand. Bias, in practical terms, refers most commonly to a difference in intensity between the reconstructed and the real (ground truth) image. Attempts to denoise the image, for example, can lead to bias.

So, before we begin, it is worth having a reference image to compare to. This one was produced by averaging the results of weighted reservoir sampling discussed above without using any denoising at all, so it should be pretty close to ground through.

To begin with, this is the output of weighted reservoir sampling discussed above, selecting 32 lights out of 400. I went ahead and added light visibility to the p_hat during the initial selection to see the effect.

The result is noisy but nothing a good denoiser can’t fix. The bigger issue here is the cost, tracing 32 shadow rays per pixel. Alternatively we can remove light visibility from the initial selection and add it only to the final, selected light. This is the result, tracing one shadow ray per pixel.

First of all a quick observation, being able to render so many shadowcasting lights using one shadow ray per pixel is very impressive! Getting more into the details, the image is noticeably noisier, and also we have lost some detail in the darker areas compared to the reference and the previous result (compare the bottom right of the images for example). This is, as discussed, the result of not including the light visibility to the p_hat during initial selection. To visualise, white pixels in the following image have selected a valid light which was ultimately shadowed (apologies for the poor choice of colour, hard to find one that stands out in such a multicoloured lit scene).

In the mentioned bottom right corner, almost all selected lights are ultimately shadowed contributing no lighting in the area.

Also, since the whole process is stochastic, pixels can end up selecting no light. To showcase of this, white pixels have received no valid light index and receive no light, another source of noise.

Results are good so far, we managed to calculate shadows for 400 lights using one shadow ray per pixel like discussed, but we can do better. The per-pixel reservoirs for the current frame contain an “important” light for the pixel but so do the reservoirs from the previous frame, so why not combine them?

Temporal reuse is pretty straightforward, all we need to do is keep around the reservoirs from the previous frame and combine them using the UpdateReservoir() method above.

    Reservoir temporalReservoir;
    InitialiseReservoir(temporalReservoir);
    
    //reproject using the motion vectors.
    int2 screenPosPrevious = (uv - velocityBuffer[screenPos]) * RTSize.xy;
        
    float3 normalPrevious = normalize(normalBufferPrevious[screenPosPrevious].xyz);
        
    Reservoir reservoirPrevious = reservoirBufferPrevious[screenPosPrevious.y * RTSize.x + screenPosPrevious.x];
     
    //restrict influence from past samples.
    reservoirPrevious.M = min(20.f * reservoir.M, reservoirPrevious.M);
    
    //some simple rejection based on normals' divergence, can be improved
    bool validHistory = dot(normalPrevious, surfaceData.Normal) >= 0.99;
    
    if (validHistory)
    {
        //add current reservoir sample
        UpdateReservoir(temporalReservoir, GetReservoirLightIndex(reservoir), p_hat * reservoir.W_y * reservoir.M, reservoir.M, rngState);
     
    float p_HatPrev = validHistory?
                        length(GetPointLightRadiance(PointLights[GetReservoirLightIndex(reservoirPrevious)], worldPos, CameraPos.xyz, surfaceData)) :
                        0.0;  
        //add sample from previous frame
        UpdateReservoir(temporalReservoir, GetReservoirLightIndex(reservoirPrevious), p_HatPrev * reservoirPrevious.W_y * reservoirPrevious.M, reservoirPrevious.M, rngState);
      
        p_hat = IsReservoirValid(temporalReservoir) ?
                length(GetPointLightRadiance(LightsBuffer[GetReservoirLightIndex(temporalReservoir)], worldPos, CameraPos.xyz, surfaceData)) : 0.0;
      
        //calculate weight of the selected lights                
        temporalReservoir.W_y = p_hat > 0.0 ? rcp(p_hat) * temporalReservoir.W_sum / temporalReservoir.M : 0.0;
    
        reservoir = temporalReservoir;
    }

You will need to add reprojection and some sort of rejection in case the previous sample is not valid (due to disocclusion etc).

Effectively what this does is to give another chance to the current pixel to select an “important” light, in case it failed the first time around, due to the reasons we discussed. This is expected to improve the quality of the result and it does:

We can take the idea of sample reuse even further. It is reasonable to assume that in the neighbourhood of a pixel the surface material properties will be similar so the selected “important” light in their reservoirs will in all likelihood be suitable for the current pixel as well. We can randomly select a few reservoirs in the vicinity of a pixel and combine them similarly to how we combined the temporal reservoir as well. How many reservoirs and the radius of the area is configurable, in this case I tried 5 samples (plus the central one) and a radius of 30 pixels, as suggested in the original paper.

    // combine current pixel's reservoir
    float p_hat = IsReservoirValid(reservoir) ?
                 length(GetPointLightRadiance(LightsBuffer[ GetReservoirLightIndex(reservoir) ], worldPos, CameraPos.xyz, surfaceData)) : 0;
             
    UpdateReservoir(reservoirNew, GetReservoirLightIndex(reservoir), p_hat * reservoir.W_y * reservoir.M, reservoir.M, rngState);
       
    for (int i = 0; i < noofNeighbours; i++)
    { 
        float2 offset = 2.0 * float2(rand01(rngState), rand01(rngState)) - 1;
    
        offset.x = screenPos.x + int(offset.x * radius);
        offset.y = screenPos.y + int(offset.y * radius);

        offset.x = max(0, min(RTSize.x - 1, offset.x));
        offset.y = max(0, min(RTSize.y - 1, offset.y));

        float neighbourDepthLinear = LineariseDepth(depthBuffer[int2(offset)].x);
        
        if (  (neighbourDepthLinear > 1.1f * depthLinear || neighbourDepthLinear < 0.9f * depthLinear)   ||
              dot(surfaceData.Normal.xyz, normalBuffer[int2(offset)].xyz) < 0.906)
        {
              // skip this neighbour sample if not suitable
              continue;
        }
    
        neighbourReservoir = reservoirBuffer[offset.y * RTSize.x + offset.x];
    
        p_hat = IsReservoirValid(neighbourReservoir) ?
                 length(GetPointLightRadiance(LightsBuffer[ GetReservoirLightIndex(neighbourReservoir) ], worldPos, CameraPos.xyz, surfaceData)) : 0;
         
        UpdateReservoir(reservoirNew, GetReservoirLightIndex(neighbourReservoir), p_hat * neighbourReservoir.W_y * neighbourReservoir.M, neighboruReservoir.M, rngState);
    }  
    
    radiance = IsReservoirValid(reservoirNew) ? GetPointLightRadiance(LightsBuffer[ GetReservoirLightIndex(reservoirNew) ], worldPos, CameraPos.xyz, surfaceData, diffuse, specular)  : 0;
    
    p_hat = length(radiance);
    
    reservoirNew.W_y = p_hat > 0.0 ? rcp(p_hat) * reservoirNew.W_sum / reservoirNew.M  : 0.0;  
    
    reservoir = reservoirNew;
    
    //apply weight to both specular and diffuse
    diffuse *= reservoir.W_y;
    specular *=  reservoir.W_y;
   
    //Find visibility for the selected light
    RayDesc ray;
    ray.Origin = worldPos.xyz;
    ray.TMin = 0.05;
    
    PointLightData lightData = PointLights[GetReservoirLightIndex( reservoir )];
    
    ray.TMax = length(lightData.Position.xyz - worldPos.xyz);
    ray.Direction = normalize(lightData.Position.xyz - worldPos.xyz);

    bool visible = FindHit(Scene, ray);
    
    diffuse *= visible;
    specular *=  visible;

The reservoir combining logic is exactly the same as in the Temporal reuse case. We only keep samples that are similar to the current one both in terms of normal direction and depth. This is the output of the spatial reuse only, again the visual improvement and noise reduction is evident.

You might have noticed that at the end of the code we recalculate visibility for the selected light. Lights in the reservoirs have visibility applied to them from the initial gather pass (their weight W_y becomes zero in this case so they don’t contribute), but during spatial reuse surface positions may change and the calculated visibility might not be valid any more. In the next screenshot I calculate final radiance without visibility as an example.

Shadows are broadly correct but not as accurate as in the previous case, some light leaking is noticeable (check the base of the pillar on the right for example)

Finally, we can combine both temporal and spatial reuse and get good quality with much reduced noise.

In this case we feed the output of the spatial reuse back to the temporal reuse pass something that will continue to improve the quality of the selected lights over time. ReSTIR won’t remove noise entirely, it will need additional denoising for that but we start at a very good place with little variance in the image. As a reminder, this is what we started with

Regular TAA at the end can further suppress some of the remaining noise to produce a cleaner image.

With this article I made an attempt to demonstrate that it is possible to set up and get basic ReSTIR working without first delving too deep in the Maths behind it, to encourage and pique people’s interest in this great technique. I need to stress though that to implement it correctly and get the correct results one will need a good understanding of the supporting Maths. I would also suggest that you start with the original paper and the pseudocode in it as it is all you will need to get it working and then study The Gentle Introduction to ReSTIR, as it does a great job at explaining the foundations and many improvements. There is so much to explore with ReSTIR still, support for area lights and emissive surfaces for example and how it can be used for GI.

Finally, I have just started exploring ReSTIR as well, if anything in this post is not accurate please let me know!

A gentler introduction to ReSTIR

Raytraced Order Independent Transparency part 2

In the previous blog post I discussed how raytracing can be used to achieve order independent transparency (OIT) for some types of transparencies and how it compares to other OIT methods like per pixel linked lists and Multi-layer Alpha blending (MLAB). The basic idea, since DXR doesn’t support distance sorted traversal of the BVH, was to use a closest hit shader to find the closest to the camera intersection and then use the position of the intersection as the origin of a new ray to trace through the BVH. That worked well in that it achieves OIT but the fact that each ray has to traverse the TLAS from the top every time we find an intersection is not ideal.

Continue reading “Raytraced Order Independent Transparency part 2”
Raytraced Order Independent Transparency part 2

Raytraced Order Independent Transparency

About a year ago I reviewed a number of Order Independent Transparency (OIT) techniques (part 1, part 2, part 3), each achieving a difference combination of performance, quality and memory requirements. None of them fully solved OIT though and I ended the series wondering what raytraced transparency would look like. Recently I added (some) DXR support to the toy engine and I was curious to see how it would work, so I did a quick implementation.

Continue reading “Raytraced Order Independent Transparency”
Raytraced Order Independent Transparency

Experimenting with fp16, part 2

In the previous blog post I discussed how enabling fp16 for a particular shader didn’t seem to make a performance difference and also forced the compiler to allocate a larger number of VGPRs compared to the fp32 version (108 vs 81), which seemed weird as one of the (expected) advantages of fp16 is reduced register allocation. So I spent some more time investigating why this is happening. The shader I am referring to is the ResolveTemporal.hlsl one from the FidelityFX SSSR sample I recently integrated to my toy renderer.

Continue reading “Experimenting with fp16, part 2”
Experimenting with fp16, part 2

Experimenting with fp16 in shaders

With recent GPUs and shader models there is good support for 16 bit floating point numbers and operations in shaders. On paper, the main advantages of the a fp16 representation are that it allows packing two 16 numbers into a single 32 bit register, reducing the register allocation for a shader/increasing occupancy, and also allows reduction of ALU instruction count by performing instructions to packed 32 bit registers directly (i.e. affecting the two packed fp16 numbers independently). I spent some time investigating what fp16 looks like at the ISA level (GCN 5) and am sharing some notes I took.

I started with a very simple compute shader implementing some fp16 maths as a test. I compiled it using the 6.2 shading model and the -enable-16bit-types DXC command line argument.

Continue reading “Experimenting with fp16 in shaders”
Experimenting with fp16 in shaders

Stream compaction using wave intrinsics

It is common knowledge that removing unnecessary work is a crucial mechanism for achieving good performance on the GPU. We routinely create lists of visible model instances of example using frustum and other means of culling to avoid rendering geometry that will not contribute to the final image. While it is easy to create such lists on the CPU, it may not be as trivial for work generated on the GPU, for example when using GPU driven culling/rendering, or deciding which pixels in the image to raytrace reflections for. Such operations typically produce lists with invalid (culled) work items, which is not a very effective way to make use of a GPU’s batch processing nature, either having to skip over shader code or introduce idle (inactive) threads in a wave.

Continue reading “Stream compaction using wave intrinsics”
Stream compaction using wave intrinsics

Notes on screenspace reflections with FidelityFX SSSR

Today I set out to replace the old SSR implementation in the toy engine with AMD’s FidelityFX’s one but in the end I got distracted and spent the day studying how it works instead. This is a modern SSR solution that implements a lot of good practices so I’ve gathered my notes in a blog post in case someone finds it of interest. This is not intended as an exhaustive description of the technique, more like a few interesting observations.

The technique takes as an input the main rendertarget, the worldspace normal buffer, a roughness buffer, a hierarchical depth buffer and an environment cubemap. The hierarchical depth buffer is a mip chain where each mip level pixel is the minimum of the previous level’s 2×2 area depths (mip 0 corresponds to the screen-sized, original depth buffer). It will used later to speed up raymarching but can also used in many other techniques, like GPU occlusion culling.

Continue reading “Notes on screenspace reflections with FidelityFX SSSR”
Notes on screenspace reflections with FidelityFX SSSR

Order Independent Transparency: Endgame

In the past 2 posts (part 1, part 2), I discussed the complexity of correctly sorting and rendering transparent surfaces and I went through a few OIT options, including per pixel linked lists, transmission function approximations and the role rasteriser order views can play in all this. In this last post I will continue and wrap up my OIT exploration discussing a couple more transmittance function approximations that can be used to implement improved transparency rendering.

Continue reading “Order Independent Transparency: Endgame”
Order Independent Transparency: Endgame

Order independent transparency, part 2

In the previous blog post we discussed how to use a per-pixel linked list (PPLL) to implement order independent transparency and how the unbounded nature of overlapping transparent surfaces can be problematic in terms of memory requirements, and ultimately may lead to rendering artifacts. In this blog post we explore approximations that are bounded in terms of memory.

Also in the previous blog post we discussed the transmittance function

{T(z_i) = \prod_{k=0}^{i}{(1-a_k)}}

and how it can be used to describe how radiance is reduced as it travels through transparent surfaces

{\sum_{i=0}^{N-1}{c_i a_i T(z_{i-1})} + T(z_{N-1}) R}

Continue reading “Order independent transparency, part 2”
Order independent transparency, part 2