Shader tips and tricks

In this post I have collected some random shader tips and tricks for easy access. Most of them revolve around performance improvement, as one can imagine, with a slight bias towards GCN/RDNA and DirectX. Before I begin with the list of advice, some caveats first.

The caveats

  1. Take any performance advice with a grain of salt, a lot of them fall into the “best practice” category but it is not gospel.
  2. Always profile before and after an “improvement”, sometimes observed results defy common knowledge
  3. Get to know the target platform and its particularities, eg a laptop GPU can be quite powerful but lack in memory bandwidth compared to a desktop one.

The advice

  1. Prefer reading textures/rendertargets with Load() if there is no need for texture filtering, over Sample() with point sampling.
  2. The texture units support some comparison operations when sampling texture data (eg min/max), worth using them for increased speed and reduced VGPR requirements.
  3. On the subject of using the fixed function units more, it is worth creating a texture as sRGB to avoid the linearisation in the shader.
  4. When sampling a single channel use Gather4() instead of Sample(), it won’t reduce memory bandwidth but will reduce the amount of traffic to the texture units and improve cache hit rate and VGPRs needed for addressing.
  5. Use input and output instruction modifiers. Modifiers like saturate(), x2, x4, /2, /4 on output and abs(), negate on input, can allow extra operations in shaders for free. For eg this: saturate(4*(x*-abs(z)-abs(y))) is just a single instruction on GCN: v_fma_f32 v1, v2, -abs(v3), -abs(v0) mul:4 clamp
  6. Use can also use some literals like -4.0, -2.0, -1.0, 0.0, 1.0, 2.0, 4.0 and 1.0/(2.0*pi) directly in an instruction (no register allocation, again on GCN)
  7. Consider partially unroll loops when reading from textures (eg halve the number of iterations and do 2 textures reads in each) to improve scheduling.
  8. Doing the opposite can help reduce the VGPRs count in the shader and increase occupancy.
  9. Fully unrolling loops can increase the shader instruction count and reduce shader residency in instruction caches.
  10. Linear compute shader thread indexing can lead to bad cache coherence when the indices are used as texture coordinates. Swizzling the thread id can improve texture sample locality and increase cache hits
  11. Select buffer type (among Constant, Structured, ByteAddressed etc) appropriate for a particular GPU, for example Intel GPUs can benefit from a byteaddressedbuffer.
  12. Select input and output data type appropriate for your application (for eg R11G11B10 floating point rendertargets instead of RGBA16), this can reduce bandwidth and potentially VGPR use as well.
  13. It is worth converting and/or packing larger data types into smaller ones (lower precision), ALU decompression is cheaper than memory bandwidth.
  14. fp16 is a good way to reduce VGPR usage and speedup ALU ops and it is a first class citizen in newer GPUs. Actual fp16 performance will rely on finding 2x data parallelism within a single thread and staying within the fp16 instruction set for long periods of time. Also keep an eye on automatic fp16 to fp32 promotions.
  15. Context rolls can matter, especially with small drawcalls, try to batch drawcalls by state
  16. Use scalar and wave operations where possible, for faster instructions and data access. Lookout for gotchas like SV_InstanceID which is not wavefront invariant.
  17. Even if an operation can’t be fully scalarised, you can use a waterfall loop to partially scalarise it.
  18. Memory bandwidth is the biggest bottleneck in a shader, avoid cache misses, especially when sampling with noise, consider de-interleaving to achieve good locality. This can help with thread divergent work like raytracing/raymarching.
  19. Use NonUniformIndex when indexing arrays of textures, even if you create them in code.
  20. Store critical access data, as constants, straight into the root signature but don’t overdo it because it has limited space it may spill to main memory
  21. Group Maths operations by data type. Mixing scalar and vector types in operations, instead of grouping them by dimension, can lead to wasted ALU instructions in a shader. This is an example of how reordering a series of multiplications can drop the number of instructions by 25%.
  22. Do not fear branches in the shader but reason about their use (how divergent they can be, uniform is fine, non-uniform maybe not). Leaving big branches, not often taken, in the code will bring total VGPR allocation up and can affect occupancy.
  23. Small branches of code may perform better when “flattened”.
  24. Consider using “tiled” processing (classifying in each pixel tile, say 16×16 and run a different shader for similar tiles) if there is a lot of divergence in the shader
  25. Both VGPR count and Local Data Store allocation can affect occupancy, worth keeping an eye on both
  26. Low occupancy is not always bad, the compiler may have other means to hide memory latency. An indication about this ability is the distance between the texture issue and texture sample use (look for instruction s_waitcnt)
  27. High occupancy/low VGPR count is not always good: the compiler may be forced to “serialise” memory fetches to reuse VGPRs more which can lead to bad scheduling and it can also trash the cache.
  28. Do not do (expensive) work in out of rendertarget/screen threads.
  29. Transfer work from pixel shaders to compute shaders to remove potential export stalls (especially if the pixel shader is short). Similarly, this could benefit shader code with a lot of early-outs and varying workloads.
  30. Beware of early z deactivation (writing to SV_Depth, a UAV, alpha testing, alpha to coverage etc), it can lead to wasted pixel shader work.
  31. Stencil culling is generally faster than discarding the pixel in the pixel shader
  32. Use bit twiddling hacks, a lot of them are usable in shaders as well.
  33. A compute shader threadgroup should fill at least a few wavefronts/warps, for eg 128/256 threads on GCN. The number also depends on the registers used per thread, to achieve good occupancy, and the need to share data between the threads of the group or not (for more).
  34. Avoid non native instructions like atan, they expand into a large number of native instructions
  35. Avoid integer division, it is a non-native operation. Keep an eye on the D3D12 debug layer, it will warn for its presence.
  36. Use expensive (eg trigonometric) function approximations but profile to determine if it is actually an improvement.
  37. InverseLerp is a useful little function to get the fraction based on a range and a distance. I frequently use it a cheap replacement of smoothstep.
  38. It’s worth storing indices/loop counts you pass to shaders as ints instead of floats it’ll save you some conversion instructions and maybe registers. Make sure that you do it on both ends and not only in the shader
  39. When using SV_DispatchThreadID in the compute shader to emulate pixel shader’s SV_Position it is worth adding half a pixel to it else you may get errors (like world position reconstruction mismatch, when doing TAA). SV_Position in the pixel shader comes with the half pixel already added.
  40. Packoffsets in constant buffers do not have to be consecutive or in order, or even start at zero. Prepare for some debugging pain if your shader reflection data doesn’t capture this.
  41. Bank conflicts when reading LDS can increase latency of memory read instructions as it serialises them.
  42. Setting any vertex position to NaN is a good way to cull a triangle in the Vertex Shader. Depending on the GPU, this can be achieved by writing 0/0 or asfloat(0x7fc00000).
  43. In HLSL, asfloat(0x7F800000) can also be used to represent “Infinity”, for setting the far plane to for example.
  44. GetDimensions() is considered a texture instruction. Better pass texture dimensions through a Constant Buffer.
  45. Consider moving pixel shader work to the vertex shader but bear in mind that the process of passing data from the vertex to the pixel shader can become a bottleneck if too much, you’ll always pay for the work even if the pixel is culled/discarded and that on some architectures texture access can be slower in the vertex shader due to reduced cache locality.
  46. NaNs can propagate through the pipeline and destroy the output. Catch NaNs in the shader and visualise them for easier debugging
  47. When debugging a shader feature try to reduce the problem dimensionality by, for example, providing known fixed values as an input, as opposed to a texture.
  48. Visual shader debugging (make the shader output a constant value, eg red, for each different path) is sometimes the fastest way to determine what is wrong with it.
  49. Vertex attribute interpolation happens in the shader in modern GPUs, it is worth using the “nointerpolation” modifier when no interpolation is actually needed, to reduce ALU and VGPR usage.
  50. On some architectures, binding the depth buffer as a texture will decompress it making subsequent z-testing more expensive. Make sure that you are done using it for geometry rendering before using it as a texture (not so easy in deferred shading engines where you need it for lighting, before rendering transparent meshes).
  51. Speaking of z-testing, reverse-z is easy to setup and increases depth precision in the depth buffer, and improves z-fighting, significantly.
  52. It pays off to invest time in learning how to read shader assembly and get in the habit of inspecting the output of the shader compiler to determine the potential cost of some instructions, using tools like Shader Playground.
  53. To achieve good performance one needs to utilise the GPU fully, both ALU and fixed-function units. This involves profiling each case to identify bottlenecks and often going against wisdom to make a shader less efficient if that means that it can overlap better with other work. This presentation is a good chronicle of this.
Shader tips and tricks

Occlusion and directionality in image based lighting: implementation details

I got a few follow-up questions on the blog post I published a few days ago on occlusion and directionality in image based lighting, so I put together a quick follow-up to elaborate on a few points and add some more resources.

To implement the main technique the exploration was based on, Ground Truth ambient occlusion, it is worth starting with the original paper by Jimenez et al. This is best read in conjunction with the Siggraph 2016 presentation, it will help to understand the paper better. The paper also includes fairly detailed pseudo-code for the GTAO and bent normals implementation, it will help to also use Intel’s implementation of the technique as a reference, it clarifies some parts of it as well. At the moment the sample does not seem to implement directional GTAO, in which the visibility cone is combined with the cosine and projected to SH.

Continue reading “Occlusion and directionality in image based lighting: implementation details”
Occlusion and directionality in image based lighting: implementation details

Notes on occlusion and directionality in image based lighting.

Update: I wrote a follow-up post with some implementation details and some more resources here.

Image based lighting (IBL), in which we use a cubemap to represent indirect radiance from an environment, is an important component of scene lighting. Environment lighting is not always uniform and has often a strong directional component (think sunset or sunrise or a room lit by a window) and that indirect light should interact with the scene correctly, with directional occlusion. I spent some time exploring the directionality and occlusion aspects of IBL for diffuse lighting, with a sprinkle of raytracing, and made some notes (and pretty pictures).

Continue reading “Notes on occlusion and directionality in image based lighting.”
Notes on occlusion and directionality in image based lighting.

Shaded vertex reuse on modern GPUs

A well known feature of a GPU is the post transform vertex cache, used in cases where a drawcall uses an index buffer to index the vertex to be processed, to cache the output of the vertex shader for each vertex. If subsequently the same vertex is indexed, as part of another triangle, the results are already in the cache and the GPU needs not process that particular vertex again. Since all caches are of limited capacity, rendering engines typically rearrange the vertex indices in meshes to encourage more locality in vertex reuse and better cache hit ratio.

Continue reading “Shaded vertex reuse on modern GPUs”
Shaded vertex reuse on modern GPUs

The curious case of slow raytracing on a high end GPU

I’ve typically been doing my compute shader based raytracing experiments with my toy engine on my ancient laptop that features an Intel HD4000 GPU. That GPU is mostly good to prove that the techniques work and to get some pretty screenshots but the performance is far from real-time with 1 ray-per-pixel GI for the following scene costing around 129 ms, rendering at 1280×720 (plugged in).

Continue reading “The curious case of slow raytracing on a high end GPU”
The curious case of slow raytracing on a high end GPU

Book review: 3D Graphics Rendering Cookbook

I was recently invited to review the new 3D Graphics Rendering Cookbook book by Sergey Kosarevsky and Viktor Latypov. The main focus of the book is the implementation of a large variety of graphics techniques using both modern OpenGL and Vulkan, an interesting approach that can show the parallels between the two graphics APIs and act as a steppingstone for less experienced programmers towards a better understanding of Vulkan.

Continue reading “Book review: 3D Graphics Rendering Cookbook”
Book review: 3D Graphics Rendering Cookbook

Raytracing tidbits

Over the past few months I did some smaller scale raytracing experiments, which I shared on Twitter but never documented properly. I am collecting them all in this post for ease of access.

On ray divergence

Raytracing has the potential to introduce large divergence in a wave. Imagine a thread with a shadow ray shooting towards the light hitting a triangle and “stopping” traversal while the one next to it missing it and having to continue traversal of the BVH. Even a single long ray/thread has the potential to hold up the rest of the threads (63 on GCN and 31 on NVidia/RDNA) and prevent the whole wave from retiring and freeing up resources.

Continue reading “Raytracing tidbits”
Raytracing tidbits

Experiments in Hybrid Raytraced Shadows

A few weeks ago I implemented a simple shadowmapping solution in the toy engine to try as a replacement for shadow rays during GI raytracing. Having the two solutions (shadomapping and RT shadows) side by side, along with some offline discussions I had, made me start thinking about how it would be possible to combine the two into a hybrid raytraced shadowed solution, like I did with hybrid raytraced reflections in the past. This blog post documents a few quick experiments I did to explore this issue a bit.

Continue reading “Experiments in Hybrid Raytraced Shadows”
Experiments in Hybrid Raytraced Shadows

How to read shader assembly

When I started graphics programming, shading languages like HLSL and GLSL were not yet popular in game development and shaders were developed straight in assembly. When HLSL was introduced I remember us trying, for fun, to beat the compiler by producing shorter and more compact assembly code by hand, something that wasn’t that hard. Since then shader compiler technology has progressed immensely and nowadays, in most cases, it is pretty hard to produce better assembly code by hand (also the shaders have become so large and complicated that it is not cost effective any more anyway).

Continue reading “How to read shader assembly”
How to read shader assembly

RDNA 2 hardware raytracing

Reading through the recently released RDNA 2 Instruction Set Architecture Reference Guide I came across some interesting information about raytracing support for the new GPU architecture. Disclaimer, the document is a little light on specifics so some of the following are extrapolations and may not be accurate.

According to the diagram released of the new RDNA 2 Workgroup Processor (WGP), a new hardware unit, the Ray Accelerator, has been added to implement ray/box and ray/triangle intersection in hardware.

Continue reading “RDNA 2 hardware raytracing”
RDNA 2 hardware raytracing