Experimenting with fp16 in shaders

With recent GPUs and shader models there is good support for 16 bit floating point numbers and operations in shaders. On paper, the main advantages of the a fp16 representation are that it allows packing two 16 numbers into a single 32 bit register, reducing the register allocation for a shader/increasing occupancy, and also allows reduction of ALU instruction count by performing instructions to packed 32 bit registers directly (i.e. affecting the two packed fp16 numbers independently). I spent some time investigating what fp16 looks like at the ISA level (GCN 5) and am sharing some notes I took.

I started with a very simple compute shader implementing some fp16 maths as a test. I compiled it using the 6.2 shading model and the -enable-16bit-types DXC command line argument.

Continue reading “Experimenting with fp16 in shaders”
Advertisement
Experimenting with fp16 in shaders

Stream compaction using wave intrinsics

It is common knowledge that removing unnecessary work is a crucial mechanism for achieving good performance on the GPU. We routinely create lists of visible model instances of example using frustum and other means of culling to avoid rendering geometry that will not contribute to the final image. While it is easy to create such lists on the CPU, it may not be as trivial for work generated on the GPU, for example when using GPU driven culling/rendering, or deciding which pixels in the image to raytrace reflections for. Such operations typically produce lists with invalid (culled) work items, which is not a very effective way to make use of a GPU’s batch processing nature, either having to skip over shader code or introduce idle (inactive) threads in a wave.

Continue reading “Stream compaction using wave intrinsics”
Stream compaction using wave intrinsics