Inspired by an interesting discussion on Twitter about its use in games, I put together some thoughts on the z-prepass and its use in the rendering pipeline.
To begin with, what is a z-prepass (zed-prepass, as we call it in the UK): in its most basic form it is a rendering pass in which we render large, opaque meshes (a partial z-prepass) or all the opaque meshes (a full z-prepass) in the scene using a vertex shader only, with no pixel shaders or rendertargets bound, to populate the depth buffer (aka z-buffer).
A big advantage of a (full) z-prepass is that with the depth buffer containing the depths of the closest to the camera opaque surfaces, there is a guarantee that any subsequent geometry rendering drawcall will have zero overdraw, meaning that the (potentially expensive) pixel shader will be executed on once per pixel. Even in the case of a partial z-prepass we can take advantage of this in many cases, for example rendering the walls of a building only in the z-prepass can prevent the GPU from running the pixel shader for the building’s contents.
Adding a z-prepass to the rendering pipeline that potentially draws all the meshes in the scene one additional time sounds counter-intuitive and apart for the extra GPU work it can overburden the CPU as well with the increased drawcall count. Due to the simple nature of the z-prepass though (only write depth to the z-buffer), there are opportunities to simplify it, and reduce its cost. For example:
- a position only vertex buffer can bound to reduce memory bandwidth
- because we are interested only in the position and not normals, uvs etc, the same vertex shader can be used, as well as the same render states, for more drawcalls increasing batching and reducing the per-drawcall overhead
- in some cases it could be possible to use lower polycount mesh lods, if your lods are conservative in the sense that lower lods are totally contained in the higher polycount ones to avoid z-fighting artifacts. You could also, especially in the case of partial z-prepass create specially made proxies, such as simplified meshes for buildings (again paying attention to z-fighting)
The same setup can be used for the shadowmap pass as well and it may well be the case that the engine already supports it.
If you are using a forward rendering architecture, and unless you can adequately distance-sort all your opaque meshes front to back, then a z-prepass is in most cases necessary to reduce the overdraw (i.e avoid shading each pixel many times in this pass). Forward pass shaders are quite expensive in general, calculating lighting and shadowing for many light types, environment lighting and final shading for a surface and are very sensitive to overdraw. A z-prepass helps reduce the overdraw especially in cases of dense geometry that can’t be easily distance sorted, like foliage.
A deferred shading architecture reduces the per-pixel overhead by design, separating geometric complexity from lighting and shading, using a dedicated g-prepass to populate a number of rendertargets with material information. In theory, a g-prepass should be relatively cheap, deferring the expensive lighting and shading calculations to a later, screen-space pass. As materials become increasingly complicated and the geometric complexity of the game worlds increases, a z-prepass could still be useful with a deferred shading architecture as well, and many games that use that rendering architecture resort to it to, among others, reduce the overdraw during the g-prepass.
Having a depth buffer early in the rendering pipeline, other than improving overdraw, can open the way for many techniques and optimisations, for example:
- Alpha testing with depth equal: useful to speed up rendering of alpha tested geometry (especially foliage), perform the alpha testing during the z-prepass and render the meshes in subsequent passes with a z-equal depth test. Allows to keep early-z optimisations where it matters most, with expensive shaders. This will require binding a uv vertex stream to the vertex shader and a pixel shader during the z-prepass to read textures and implement the clipping.
- SSAO: It is possible to kick off screen space ambient occlusion calculations early. This will mostly likely require a full z-prepass to capture the contribution of small meshes as well and in some algorithms, due to lack of normals, to reconstruct normal from depth which can be coarse (corresponds to the face normal).
- SSR: Raymarching the depth buffer we can determine hitpoints for screen space reflections. Again this will have to rely on depth reconstructed normals. We could either store the hitpoints to sample the main rendertarget to resolve reflection colours or use the previous frame’s rendertarget. This would likely require a full z-prepass as well.
- Screen space shadows: Similarly for screen space shadows we can raymarch the depth buffer towards a light to determine higher resolution shadows in screen space. Same as the above, this would require a full z-prepass to capture shadows from small meshes.
- Deferred shadows: Particularly for a forward rendering pipeline, the depth buffer can be used to reconstruct world positions and calculate shadows in screen space, reducing the complexity of the forward pass shader. For both the above shadowing methods, the usual “skip calculations if facing away from the light (NdotL<0)” optimisation will have to rely on depth reconstructed normals.
- Occlusion culling: Having the depth buffer early in the frame allows for hierarchical-z occlusion culling techniques which allow culling of occluded meshes in subsequent passes (g-prepass or forward pass). Even if hierarchical-z occlusion culling is not implemented, some occlusion can be achieved using hardware occlusion queries, or even predication queries using the depth buffer. A partial z-prepass with depth from the main occluders is suitable for this.
- Light binning: Having access to scene depths can be used to improve light binning in the case of a tiled or clustered shading lighting architecture.
- Tighter shadow cascade fitting: Similarly, the scene depth range can be used to better fit the cascades to the scene in a cascaded shadow mapping system to increase shadowmap utilisation. A caveat in this case is that the shadowmaps have to be rendered after the z-prepass, but this can be good as we could overlap screenspace work (eg SSAO) with shadowmap rendering.
- Deferred decals: Another technique that can benefit from early access to scene depths is deferred decals, in which case the decals can be projected to the depth buffer and decal materials stored in a decal “g-buffer” to be blended later during the g-prepass or the forward pass.
- Raytracing: The depth buffer can be used to reconstruct the world origin (and face normals) for raytracing shadows and reflections, potentially at a lower resolution, decoupled from an expensive forward pass.
- Stencil masks: Not directly related to the z-prepass but we could also use the opportunity to populate the stencil buffer with some values that can be used later, to exclude particular meshes from subsequent rendering passes.
The list is not exhaustive. Some of those techniques could benefit a forward rendering pipeline more than a deferred one, as there are not many opportunities to have access to the scene depth in screenspace (or any other data for screen space techniques for that matter) before the forward pass. Even in the case of a deferred shading pipeline though, this can offer opportunities to reduce the overdraw in the g-prepass and to allow for better scheduling and overlapping of work, for example the SSAO pass could potentially be overlapped with the g-prepass.
As it was pointed out in the Twitter thread I mentioned in the opening paragraph though, the z-prepass technique comes with a large YMMV banner. The actual gain it can offer depends a lot on the content and the rendering pipeline and profiling will be needed to determine it. Like it was discussed above, a forward rendering pipeline almost always will benefit from it, but a deferred shading one could as well especially in scenes with a large geometric complexity, such as those with a lot of foliage, especially if no good occlusion culling approach can be used, and alpha tested geometry. This will have to be viewed in light of the increased CPU overhead, to submit drawcalls twice, and if there are opportunities to improve batching in the engine. In this context a decision on partial vs full z-prepass must be made.
Enabling some occlusion culling early in the frame is definitely useful and can reduce the CPU load (especially if a GPU driven submission is used) and the vertex load for subsequent passes (which can offset the overhead in the z-prepass potentially). Opportunities to overlap screenspace work (for example SSAO) with geometry rendering in the case of a deferred shading architecture will depend on the bottlenecks of the g-prepass as both passes are usually bandwidth bound. Reordering the rendering pipeline to perform the z-prepass first and then the shadowmap rendering could allow for some improved overlapping between the compute and graphics pipelines (eg calculate SSAO or screens space shadows while rendering the shadowmaps). Decoupling the forward pass resolution from screen space effects like SSR, SSAO etc is good in terms of performance and allows for calculating them at a lower resolution.
There is another caveat to consider, binding the depth buffer produced by the z-prepass as a texture for the screen spaces (SSAO, SSR etc) will decompress it on some GPU architectures, making it less efficient when z-testing later, during the forward or g-buffer passes. In this case, it may be worthwhile considering binding a pixel shader during the z-prepass to output the depth in a separate rendertarget and use that as an input.