The accuracy flag (low | fast | safe) provides a tradeoff
between performance and acceptable error in terms of ULP (Unit in the Last Place).
The product of two bfloat16 operands is in single-precision
floating-point (fp32) format. The bfloat16 vector
multiplier is immediately followed by an fp32 adder in the vector
processor execution pipeline to implement multiply-accumulate or multiply-add
operations.
fp32 values have an 8-bit exponent and a 23-bit mantissa
with an implicit leading 1 for normal numbers::
- Maximum positive value:
(2-2-23)x2127~=3.403 x 1038 - Minimum positive value:
2-126~=1.175x10-38
bfloat16 numbers retain the 8-bit exponent, but the mantissa
is reduced to 7 bits with an implicit leading 1. An fp32 value is
emulated using the sum of 3 bfloat16 values. Unfortunately, this
representation may not be exact for some extremely low values as the exponent of
some of the {{bfloat16 }} values might be too small to be
representable as a normal value.With X as an fp32 value and Z_0,
Z_1, and Z_2 as bfloat16
values, then:
- Z_0=(bfloat16)X # cast X to a
bfloat16value - Z_1=(bfloat16)(X -(float) Z_0) # cast Z_0 to an
fp32value - Z_2=(bfloat16)(X-(float) Z_0-(float) Z_1)
- X≈(float) Z_0+(float) Z_1+(float)Z_2
Based on the selected accuracy of the floating-point operation, the following
table shows the ULP error when emulating fp32 multiplication and
the corresponding code that is executed.
These ULPs are given for X and Y values such that X, Y, and X*Y have
an exponent in the range [-102, +126]. This is equivalent to an FP32 magnitude in
the range [1.97e-31 ; 1.70e+38]
| Accuracy Flag | ULP Range or ULP frequency | Executed Assembly Code |
|---|---|---|
| low | 6 to 11 |
|
| fast | 0 : 56.11% 1 : 37.68% 2 : 5.36% 3 : 0.83% 4 : 0.02% |
|
| safe | 0 : 99.11% 1 : 0.89% 2 : 5.8e-4% |
|
Higher accuracy requires more operations, resulting in reduced throughput and longer latency.