Ways to speedup pixel shader execution

Catching up on my Twitter DMs I came across a question about ways to increase the execution speed of pixel/fragment shaders. This is a quite broad issue and the specifics will depend on the particularities of each GPU/platform and the game content but I am expanding on my “brain-dump” style answer in this post in case others find it useful. This is not a comprehensive list, more like a list of high level pointers to get one started.

A rendering engine’s performance should be viewed holistically, in a top-down fashion using tools like Nsight and PIX or whatever your target platform supports, to identify where the performance bottleneck is before focusing on a specific area like the pixel shader cost. The GPU is a pipeline with various fixed and programmable units and each can have different impact on performance based on the content we push to it, eg number and size of triangles, implemented graphics technique complexity (lighting model, SSAO, SSR quality), rendertarget and texture resolution. Having said that, a large chunk of GPU time in a frame is spent on pixel shaders so in all likelihood you’ll spend a lot of time optimising them.

As in most optimisations efforts, the performance of a pixel shader can be improved in two ways: by not running it at all if possible and, if not, by making it cheaper to run.

So a first good step would to be to avoid rendering occluded triangles at all, using a CPU or GPU based solution. Out-of-frustum triangles should definitely be culled as well but they don’t typically reach the pixel shader stage.

The GPU also tries to help us with the former (avoid running the pixel shader), looking for opportunities to prevent the pixel shader from being executed, either at a coarse level using HiZ (or ZCULL on Nvidia GPUs), or by testing the fragment’s depth against the depth buffer before running the pixel shader (early Z). We can help it being more effective at that as well

  • Obvious but worth stating, make sure to use a depth-buffer and a z-test and to draw (solid) meshes front to back to avoid overdraw.
  • By making sure that we are not drawing very small triangles (less than 2×2 pixels, ideally more, the larger the screen screen coverage the better HiZ will be able to cull whole tiles of occluded pixels).
  • In some cases, especially when drawing content like dense foliage, a z-prepass to populate the depth buffer before doing a g-buffer or a forward pass will help cull pixels and reduce overdraw.
  • Performing some operations in the pixel shader, such as clipping and depth writing (sv_depth), when depth writes are on, as well as writing to an Unordered Access View buffer, can prevent the GPU from using HiZ or early Z, as it won’t actually know if the pixel shader will write to the depth buffer or not until after it has executed it. There are ways to bypass this behaviour by using [earlydepthstencil] and conservative depth output in the shader.
  • Also using the stencil buffer, populating it with a very cheap draw that only writes to it, may help as well as it can skip pixel shader invocation entirely (again, profile to determine actual benefit, YMMV).

When a pixel can’t be culled and has be shaded then you can try to reduce the execution cost/impact.

  • A memory read (textures, rendertargets, data buffers) is the most expensive operation that you can perform in a shader, especially if the GPU has to reach outside the cache to the main memory. Try to use as few texture reads as possible, and use smaller data formats (eg try R11G11B10 floating point rendertargets instead of RGBA16) wherever possible. Also make sure that the textures are compressed and mipmapping is activated.
  • NVidia GPUs favour constant buffers over structured buffers, worth considering them when your data can fit in one.
  • On some architectures, binding the depth buffer as a texture will decompress it making subsequent z-testing more expensive. Make sure that you are done using it for mesh rendering before using it as a texture (not so easy in deferred shading engines where you need it for lighting, before rendering transparent meshes).
  • On some GPUs, bilinearly sampling from large data type textures/rendertargets (eg > 32bpp) is not full rate (can take many clocks per instruction to complete). Also anisotropic filtering can be expensive, make sure that the visual improvement is worth the cost.
  • Reduce the number of vector registers used in the shader. Allocating a large number of registers reduces the ability of the GPU to parallelise work on the compute units (A more technical explanation is that the GPU will stall a warp/wavefront on a memory instruction until the requested data are fetched from/written to memory. When that happens the GPU will try to execute another instruction from a different warp/wavefront but for this to work other warps/wavefronts must be available in the queue to be used. Using a lot of registers limits its ability to schedule this).
  • On a related note, pay attention to unbalanced if-branches, as the compiler will allocate registers for the largest branch even if it is not taken very often, or at all. In such cases it may be worth splitting the if-paths into two separate shaders.
  • If you are not doing many texture reads but the shader is still slow then try doing less ALU operations in the shader. In some cases (of a large number of ALU operations and very few texture operations in the shader) it might be worth precalculating data and storing it in buffers, especially if you are doing a lot of expensive maths operations. Take this with a pinch of salt and profile though as ALU operations are still much faster than texture operations.
  • Some effects, like screen space ambient occlusion and bloom, can be calculated at a lower resolution to reduce their cost.
  • Operations that don’t need to be done per pixel (or their output can be interpolated) could be done in the vertex shader and the result passed to the pixel shader. Be careful not to overdo it though, because a large number of interpolants can have a performance impact.
  • It may be worth converting a pixel shader that operates in 2D (screen space pass) to a compute shader especially if it does not rely on Render output unit (ROP) operations like blending. This may also speed up the shader if it is small as the ROP unit can become a bottleneck in such a case. Again this needs profiling on your target platform to determine benefit, and may benefit techniques that can use the Local Data Store to cache calculations, like blurring, more.
  • If doing alpha blending (including particles), still use a z-test to cull against solid geometry and then try to reduce overdraw by overlapping transparent meshes less, the overdraw cost by making the pixel shader cheaper and/or reducing size of transparent meshes on screen.
  • Get in the habit of inspecting the output of the shader compiler to determine the potential cost of some instructions, also using tools like Shader Playground. Inconspicuous instructions like “atan” can spawn many shader instructions. Worth reading Emil Persson’s presentations on programming GPUs efficiently.

It is also worth studying the GPU architecture, and its particularities, you are optimising for. There are a lot of references on GPU architectures online:

Also worth looking for documentation on how compute shaders work, like https://anteru.net/blog/2018/intro-to-compute-shaders/, as they are “closer to the metal” than pixel/vertex shaders and also posts on GPU profiling, such as https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/, as they often expose how a GPU works.

Finally, is always a good idea to profile any performance-improving changes that you make in isolation to determine the actual improvement. Some improvements can bring the opposite result, for example reducing the number of registers to increase occupancy may lead to texture cache thrashing because more warps/wavefronts may try to bring data from different memory areas. GPU profiling and performance improvement is more of an art than science.

Ways to speedup pixel shader execution