Experimenting with fp16 in shaders

With recent GPUs and shader models there is good support for 16 bit floating point numbers and operations in shaders. On paper, the main advantages of the a fp16 representation are that it allows packing two 16 numbers into a single 32 bit register, reducing the register allocation for a shader/increasing occupancy, and also allows reduction of ALU instruction count by performing instructions to packed 32 bit registers directly (i.e. affecting the two packed fp16 numbers independently). I spent some time investigating what fp16 looks like at the ISA level (GCN 5) and am sharing some notes I took.

I started with a very simple compute shader implementing some fp16 maths as a test. I compiled it using the 6.2 shading model and the -enable-16bit-types DXC command line argument.

cbuffer data
{
    float value1;
    float value2; 
    float value3;   
    float  pad;
}                 

[numthreads(8, 8, 1)]
void main(int2 thread_id : SV_DispatchThreadID) 
{
    half x = half(thread_id.x) * half(value1);
    half y = half(thread_id.y) * half(value2);   
                                          
    half result1 = x * half(value2);
    half result2 = y * half(value1);      
                                                           
    output1[thread_id] = result1;
    output2[thread_id] = result2;
} 

The produced ISA looks like this

  s_getpc_b64   s[0:1]                                  // 000000000000: BE801C80
  s_mov_b32     s3, s1                                  // 000000000004: BE830001
  s_load_dwordx4  s[12:15], s[2:3], 0x00                // 000000000008: C00A0301 00000000
  s_waitcnt     lgkmcnt(0)                              // 000000000010: BF8CC07F
  s_buffer_load_dwordx2  s[2:3], s[12:15], 0x00         // 000000000014: C0260086 00000000
  s_mov_b32     s0, s4                                  // 00000000001C: BE800004
  s_load_dwordx8  s[12:19], s[0:1], 0x20                // 000000000020: C00E0300 00000020
  v_mad_u32_u24  v0, s7, 8, v0                          // 000000000028: D1C30000 04011007
  s_load_dwordx8  s[20:27], s[0:1], 0x40                // 000000000030: C00E0500 00000040
  v_mad_u32_u24  v1, s8, 8, v1                          // 000000000038: D1C30001 04051008
  v_cvt_f32_i32  v2, v0                                 // 000000000040: 7E040B00
  s_waitcnt     lgkmcnt(0)                              // 000000000044: BF8CC07F
  v_cvt_pkrtz_f16_f32  v2, s2, v2                       // 000000000048: D2960002 00020402
  v_cvt_f32_i32  v3, v1                                 // 000000000050: 7E060B01
  v_mul_f16     v4, v2, v2 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_0 // 000000000054: 440804F9 04051402
  v_cvt_pkrtz_f16_f32  v3, s3, v3                       // 00000000005C: D2960003 00020603
  v_mul_f16     v4, v3, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_0 // 000000000064: 440806F9 04051503
  v_pk_mul_f16  v3, v4, v3 op_sel_hi:[0,0]              // 00000000006C: D3900003 00020704
  v_pk_mul_f16  v2, v2, v4 op_sel:[0,1] op_sel_hi:[0,1] // 000000000074: D3901002 10020902
  v_mov_b32     v4, v3                                  // 00000000007C: 7E080303
  image_store   v[3:6], v[0:3], s[12:19] dmask:0xf unorm glc d16 // 000000000080: F0203F00 80030300
  v_mov_b32     v3, v2                                  // 000000000088: 7E060302
  image_store   v[2:5], v[0:3], s[20:27] dmask:0xf unorm glc d16 // 00000000008C: F0203F00 80050200

On line 5 we read value1 and value2 from the constant buffer. Next follows some code to convert the thread index to float and the interesting work begins on line 13, where the float values (thread.x and value1) are packed to a single register (v2) using the v_cvt_pkrtz_f16_f32 instruction, which can work with both scalar and vector registers as inputs. The first input (source) of the instruction (s2 in this case, containing value1 from the constant buffer) is packed in the low 16 bits (16-31, later referred to as WORD0) and the second input (v2, containing thread_id.x) is packed in the high 16 bits (16-31, later referred to as WORD1) i.e v2 now looks like this

The mul instruction on line 15 is the fp16 version that multiplies 2 fp16 numbers packed in full 32 bit registers, let’s see how this works in greater detail.

v_mul_f16 v4, v2, v2 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_0

This instruction says: multiply 2 fp16 numbers, both selected from register v2, writing the output to the lower 16 bits of register v4 leaving the other 16 bits untouched (dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE). From input 0 (src0) use the fp16 packed in the high 16 bits (src0_sel:WORD_1), while from the second input (src1) use the fp16 in the the low 16 bits (src1_sel:WORD_0). This is effectively used to multiply 2 fp16 numbers stored in the same 32 bit register. Lines 16 and 17 implement the same for the value2 and thread_id.y multiplication.

So far we don’t seem to have saved much with fp16 over fp32, we used a single instruction to multiply a pair of fp16 numbers (plus the extra instructions for the conversion). This changes on lines 18 and 19, where the v_pk_mul_f16 instruction is used to simultaneously multiply 2 pairs of fp16s using a single instruction.

v_pk_mul_f16 v3, v4, v3 op_sel_hi:[0,0]

By default that instruction will multiply the low parts of each 32bit registers and store the fp16 result in the low part of the destination register and similarly for the high parts of source and destination registers (i.e the modifiers will be implicitly op_sel:[0,0] op_sel_hi:[1,1] and omitted). This can be overridden by using the op_sel and op_sel_hi modifiers. op_sel will determine which parts (low – 0, or hi – 1) of the inputs will be used for the low part of the destination register and op_sel_hi similarly for the high part of the destination. That particular instruction above will use the default behaviour for the low bits of the destination (i.e. store the multiplication of the low fp16 numbers in the v4 and v3 registers, op_sel:[0,0]), and then perform the same multiplication for the hi 16 bits of the destination (i.e both parts contain the result of value2 * thread.x * value1)

Line 19 does a similar operation for register v2, overriding the default selection to pick the low part of v2 and the high part of v4 to store to both parts of v2 (i.e both parts contain the result of value1 * thread.y * value2).

v_pk_mul_f16  v2, v2, v4 op_sel:[0,1] op_sel_hi:[0,1]

Finally the results are stored to the buffers using 2 image_store instructions. Worth noticing that the instructions use the d16 modifier as the compiler is aware that the input data is in 16 bits.

It is not clear yet what we’ve gained using fp16, let’s compile the same shader without the -enable-16bit-types argument, something that will promote all fp16s data types and operations to fp32

  s_getpc_b64   s[0:1]                                  // 000000000000: BE801C80
  s_mov_b32     s3, s1                                  // 000000000004: BE830001
  s_load_dwordx4  s[12:15], s[2:3], 0x00                // 000000000008: C00A0301 00000000
  s_waitcnt     lgkmcnt(0)                              // 000000000010: BF8CC07F
  s_buffer_load_dwordx2  s[2:3], s[12:15], 0x00         // 000000000014: C0260086 00000000
  s_mov_b32     s0, s4                                  // 00000000001C: BE800004
  s_load_dwordx8  s[12:19], s[0:1], 0x20                // 000000000020: C00E0300 00000020
  s_load_dwordx8  s[20:27], s[0:1], 0x40                // 000000000028: C00E0500 00000040
  v_mad_u32_u24  v0, s7, 8, v0                          // 000000000030: D1C30000 04011007
  v_mad_u32_u24  v1, s8, 8, v1                          // 000000000038: D1C30001 04051008
  v_cvt_f32_i32  v2, v0                                 // 000000000040: 7E040B00
  s_waitcnt     lgkmcnt(0)                              // 000000000044: BF8CC07F
  v_mul_f32     v2, s2, v2                              // 000000000048: 0A040402
  v_cvt_f32_i32  v3, v1                                 // 00000000004C: 7E060B01
  v_mul_f32     v3, s3, v3                              // 000000000050: 0A060603
  v_mul_f32     v2, s3, v2                              // 000000000054: 0A040403
  v_mul_f32     v6, s2, v3                              // 000000000058: 0A0C0602
  v_mov_b32     v3, v2                                  // 00000000005C: 7E060302
  v_mov_b32     v4, v2                                  // 000000000060: 7E080302
  v_mov_b32     v5, v2                                  // 000000000064: 7E0A0302
  image_store   v[2:5], v[0:3], s[12:19] dmask:0xf unorm glc // 000000000068: F0203F00 00030200
  v_mov_b32     v7, v6                                  // 000000000070: 7E0E0306
  v_mov_b32     v8, v6                                  // 000000000074: 7E100306
  v_mov_b32     v9, v6                                  // 000000000078: 7E120306
  image_store   v[6:9], v[0:3], s[20:27] dmask:0xf unorm glc // 00000000007C: F0203F00 00050600

The compiler shader is a bit longer for one, due to the extra mov instructions. It issues 4 multiplication instructions as does the fp16 version, so no gain here. In terms of register allocation, the fp32 version has 10 VGPRs and 30 SGPRs while the fp16 version has 5 VGPRs and 30 SGPRs, so there is the potential to improve occupancy here.

One other thing worth investigating is what would happen in the constant buffer data were in fp16 to begin with

cbuffer data
{
    half value1;
    half  value2; 
    float value3;   
    float2  pad;
}        

The resulting ISA will then be

  s_getpc_b64   s[0:1]                                  // 000000000000: BE801C80
  s_mov_b32     s5, s1                                  // 000000000004: BE850001
  s_load_dwordx8  s[12:19], s[4:5], 0x20                // 000000000008: C00E0302 00000020
  s_load_dwordx8  s[20:27], s[4:5], 0x40                // 000000000010: C00E0502 00000040
  s_mov_b32     s0, s2                                  // 000000000018: BE800002
  s_load_dwordx4  s[0:3], s[0:1], 0x00                  // 00000000001C: C00A0000 00000000
  s_waitcnt     lgkmcnt(0)                              // 000000000024: BF8CC07F
  s_buffer_load_dword  s0, s[0:3], 0x00                 // 000000000028: C0220000 00000000
  v_mad_u32_u24  v0, s7, 8, v0                          // 000000000030: D1C30000 04011007
  v_mad_u32_u24  v1, s8, 8, v1                          // 000000000038: D1C30001 04051008
  v_cvt_f32_i32  v2, v0                                 // 000000000040: 7E040B00
  v_cvt_pkrtz_f16_f32  v2, 0, v2                        // 000000000044: D2960002 00020480
  s_waitcnt     lgkmcnt(0)                              // 00000000004C: BF8CC07F
  v_mul_f16     v2, s0, v2 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_1 // 000000000050: 440404F9 05841400
  v_cvt_f32_i32  v3, v1                                 // 000000000058: 7E060B01
  v_cvt_pkrtz_f16_f32  v3, 0, v3                        // 00000000005C: D2960003 00020680
  v_mul_f16     v2, s0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1 // 000000000064: 440406F9 05851500
  v_pk_mul_f16  v3, v2, s0 op_sel:[0,1] op_sel_hi:[0,1] // 00000000006C: D3901003 10000102
  v_mov_b32     v4, v3                                  // 000000000074: 7E080303
  image_store   v[3:6], v[0:3], s[12:19] dmask:0xf unorm glc d16 // 000000000078: F0203F00 80030300
  v_pk_mul_f16  v2, s0, v2 op_sel:[0,1] op_sel_hi:[0,1] // 000000000080: D3901002 10020400
  v_mov_b32     v3, v2                                  // 000000000088: 7E060302
  image_store   v[2:5], v[0:3], s[20:27] dmask:0xf unorm glc d16 // 00000000008C: F0203F00 80050200

One interesting thing that happens is that on line 8, the constant buffer load instruction loads both value1 and value2 already packed in s0 and the logic of the operand selection during v_pk_mul_f16 changes a bit to reflect that but otherwise the code is pretty similar.

What will happen if we throw a float operation into the mix?

cbuffer data
{
    half value1;
    half value2; 
    float value3;   
    float2  pad;
}                 

[numthreads(8, 8, 1)]
void main(int2 thread_id : SV_DispatchThreadID) 
{
    half x = half(thread_id.x) * half(value1);
    half y = half(thread_id.y) * half(value2);    
    
    x *= value3; // multiply the fp16 x by the fp32 value3
                                          
    half result1 = x * half(value2);
    half result2 = y * half(value1);      
                                                           
    output1[thread_id] = result1;
    output2[thread_id] = result2;
} 

The resulting ISA now looks like

  s_getpc_b64   s[0:1]                                  // 000000000000: BE801C80
  s_mov_b32     s5, s1                                  // 000000000004: BE850001
  s_load_dwordx8  s[12:19], s[4:5], 0x20                // 000000000008: C00E0302 00000020
  s_load_dwordx8  s[20:27], s[4:5], 0x40                // 000000000010: C00E0502 00000040
  s_mov_b32     s0, s2                                  // 000000000018: BE800002
  s_load_dwordx4  s[0:3], s[0:1], 0x00                  // 00000000001C: C00A0000 00000000
  s_waitcnt     lgkmcnt(0)                              // 000000000024: BF8CC07F
  s_buffer_load_dwordx2  s[0:1], s[0:3], 0x00           // 000000000028: C0260000 00000000
  v_mad_u32_u24  v0, s7, 8, v0                          // 000000000030: D1C30000 04011007
  v_mad_u32_u24  v1, s8, 8, v1                          // 000000000038: D1C30001 04051008
  v_cvt_f32_i32  v2, v0                                 // 000000000040: 7E040B00
  v_cvt_pkrtz_f16_f32  v2, 0, v2                        // 000000000044: D2960002 00020480
  s_waitcnt     lgkmcnt(0)                              // 00000000004C: BF8CC07F
  v_mul_f16     v2, s0, v2 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_1 // 000000000050: 440404F9 05841400
  v_cvt_f32_i32  v3, v1                                 // 000000000058: 7E060B01
  v_cvt_pkrtz_f16_f32  v3, 0, v3                        // 00000000005C: D2960003 00020680
  v_mul_f16     v2, s0, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1 // 000000000064: 440406F9 05851500
  v_cvt_f32_f16  v3, v2                                 // 00000000006C: 7E061702
  v_mul_f32     v3, s1, v3                              // 000000000070: 0A060601
  v_cvt_pkrtz_f16_f32  v3, 0, v3                        // 000000000074: D2960003 00020680
  v_pk_mul_f16  v3, s0, v3 op_sel:[1,1] op_sel_hi:[1,1] // 00000000007C: D3901803 18020600
  v_mov_b32     v4, v3                                  // 000000000084: 7E080303
  image_store   v[3:6], v[0:3], s[12:19] dmask:0xf unorm glc d16 // 000000000088: F0203F00 80030300
  v_pk_mul_f16  v2, s0, v2 op_sel:[0,1] op_sel_hi:[0,1] // 000000000090: D3901002 10020400
  v_mov_b32     v3, v2                                  // 000000000098: 7E060302
  image_store   v[2:5], v[0:3], s[20:27] dmask:0xf unorm glc d16 // 00000000009C: F0203F00 80050200

Things are going well until line 18 in which the compiler has to promote the fp16 result to fp32 to perform the float multiplication

v_cvt_f32_f16  v3, v2  

and then convert to fp16 again on line 20. In this simple shader mixing a float operation just increases the number of instructions but it is likely that it could increase the register allocation in more complex ones so it would be good to be avoided.

This was useful to tell us how fp16 works under the hood and showed some potential but this simple shader isn’t enough to reveal any real benefits. For that I will try an actual shader from the FidelityFX SSSR sample, which uses fp16 extensively. I realise that I keep going back to the SSSR sample quite frequently for my investigations since I integrated it to my toy renderer. I quite like that sample, it implements a lot of modern techniques and it saves me the trouble of implementing them for some quick investigations.

So compiling the ResolveTemporal.hlsl shader using the same as above DXC configuration I get a VGPR allocation of 110 and an SGPR allocation of 37. Compiling the shader without support for fp16 I get a VGPR allocation of 83 and a SGPR allocation of 49, so the fp32 version has the advantage here (SGPR allocation rarely becomes the bottleneck).

I won’t list the shader ISA here as it is 1200+ lines but on first inspection it seems to contain a healthy dose of vk_pk instructions like the following

  v_pk_add_f16  v18, v18, v24 op_sel_hi:[1,1]           // 000000000D7C: D38F0012 18023112
  v_pk_mul_f16  v24, v26, s12 op_sel_hi:[1,0]           // 000000000D84: D3900018 0800191A
  v_pk_mul_f16  v26, v31, v31 op_sel:[1,1] op_sel_hi:[0,0] // 000000000D8C: D390181A 00023F1F
  v_pk_add_f16  v16, v16, v17 op_sel_hi:[1,1]           // 000000000D94: D38F0010 18022310
  v_pk_mul_f16  v17, v19, s3 op_sel_hi:[1,0]   

but it also seems to contain a few promotions from fp16 to fp32, which like discussed are not ideal.

  v_cvt_f32_f16  v5, v2 src0_sel: WORD_1                // 000000001F00: 7E0A16F9 00050602
  v_cvt_f32_f16  v11, v44                               // 000000001F08: 7E16172C
  v_cvt_f32_f16  v2, v2                                 // 000000001F0C: 7E041702
  v_cvt_f32_f16  v12, v8                                // 000000001F10: 7E181708
  v_cvt_f32_f16  v13, v22                               // 000000001F14: 7E1A1716
  v_cvt_f32_f16  v16, v20                               // 000000001F18: 7E201714

There must be some float operations mixed in somewhere, a quick investigation reveals that FFX_DNSR_Reflections_ClipAABB() mixes fp32 and fp16 operations:

half3 FFX_DNSR_Reflections_ClipAABB(half3 aabb_min, half3 aabb_max, half3 prev_sample) {
    // Main idea behind clipping - it prevents clustering when neighbor color space
    // is distant from history sample

    // Here we find intersection between color vector and aabb color box

    // Note: only clips towards aabb center
    float3 aabb_center = 0.5 * (aabb_max + aabb_min);
    float3 extent_clip = 0.5 * (aabb_max - aabb_min) + 0.001;

    // Find color vector
    float3 color_vector = prev_sample - aabb_center;
    // Transform into clip space
    float3 color_vector_clip = color_vector / extent_clip;
    // Find max absolute component
    color_vector_clip       = abs(color_vector_clip);
    half max_abs_unit = max(max(color_vector_clip.x, color_vector_clip.y), color_vector_clip.z);

    if (max_abs_unit > 1.0) {
        return aabb_center + color_vector / max_abs_unit; // clip towards color vector
    } else {
        return prev_sample; // point is inside aabb
    }
}

There is one here:

half FFX_DNSR_Reflections_Luminance(half3 color) { return max(dot(color, float3(0.299, 0.587, 0.114)), 0.001); }

And a fun one here (in FFX_DNSR_Reflections_ResolveTemporal())

 // Blend with average for small sample count
 new_signal.xyz                                  = lerp(new_signal.xyz, avg_radiance, 1.0 / max(num_samples + 1.0f, 1.0));

If you can’t see the issue in the last one at first, adding the “f” after the literal 1.0 will force promotion of all operands to float.

Mixing fp16/fp32 operation issues may be quite tricky to locate, especially when you retrofit support for fp16 to a large codebase. Quick tip, always inspect the DXC warnings, they may tell you when floats are demoted to fp16s:

ffx_denoiser_reflections_common.h:53:59: warning: conversion from larger type 'float' to smaller type 'half', possible loss of data [-Wconversion]
half FFX_DNSR_Reflections_Luminance(half3 color) { return max(dot(color, float3(0.299, 0.587, 0.114)), 0.001); }

Fixing those issues drops VGPR allocation only by one to 109 and the SGPR allocation by 3 to 34.

There are also some promotions to float that I couldn’t fix

  v_cvt_f32_f16  v1, v0                                 // 000000002120: 7E021700
  v_cvt_f32_f16  v5, v3 src0_sel: WORD_1                // 000000002124: 7E0A16F9 00050603
  s_movk_i32    s0, 0x0204                              // 00000000212C: B0000204
  v_mul_f16     v4, v4, v2 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1 // 000000002130: 440804F9 05051404
  v_cvt_f32_f16  v6, v3                                 // 000000002138: 7E0C1703
  v_cmp_class_f32  s[2:3], v1, s0                       // 00000000213C: D0100002 00000101
  v_cmp_class_f32  s[6:7], v5, s0                       // 000000002144: D0100006 00000105
  v_cmp_class_f32  s[8:9], v1, 3                        // 00000000214C: D0100008 00010701
  v_cmp_class_f32  s[12:13], v5, 3                      // 000000002154: D010000C 00010705

The v_cmp_class_f32 instructions refer to this part of FFX_DNSR_Reflections_ResolveTemporal:

        if (any(isinf(new_signal)) || any(isnan(new_signal)) || any(isinf(new_variance)) || any(isnan(new_variance))) 
        {
            new_signal   = 0.0;
            new_variance = 0.0;
        }

Although GCN 5.0 supports a 16 bit version of v_cmp_class_f32 the compiler doesn’t use it for some reason (all inputs are fp16).

Would VGPR allocation change if we targeted RDNA instead of GCN? Without fp16 support the compiler allocates 87 VGPRs and 48 SGPRs while with fp16 support it allocates 106 VGPRs and 44 SGPRs, so the fp32 version still has the advantage.

In the end, all it matters is actual performance impact. On my integrated AMD GPU (GCN5) the temporal resolve SSSR pass costs the same for both fp16 and fp32 while on the NVidia RTX 3080 mobile GPU the fp32 version is actually 23% faster, so for that particular shader, fp16 is not really a gain.

It appears that, as with many techniques and features, the actual gain will vary depending on the application and as always it needs profiling to determine if it will improve a rendering pass.

Advertisement
Experimenting with fp16 in shaders

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s