Depending on the chosen accuracy for floating-point operation
(low | fast | safe), the precision in terms of
Unit of Last Precision (ULP) (last correct bit) is different. The precision dictates
the number of operations to be executed by the processor. The floating-point
addition relies on the floating-point adder that exists in the hardware right after
the bfloat16 vector multiplier.
Single precision floating-point (fp32) values have an 8-bit exponent and a 23-bit mantissa with an
implicit heading 1 for normal numbers:
- Maximum positive value:
(2-2-23)x2127~=3.403 x 1038 - Minimum positive value:
2-126~=1.175x10-38
bfloat16 numbers still have this
8-bit exponent but the mantissa is reduced to 7 bits with an implicit heading 1. An
fp32 value is translated into the addition of
3 bfloat16 values. Unfortunately this translation
is not exact for some extremely low values because the exponent of the bfloat16cannot be low enough.
Based on the precision of the floating-point operation, the following table specifies the ULP and the corresponding code that is executed.
These ULPs are given for X and Y values such that X, Y, and X*Y have
an exponent in the range [-102, +126]. This is equivalent to an FP32 magnitude in
the range [1.97e-31 ; 1.70e+38]
| Precision | ULP Range or ULP frequency | Executed Assembly Code |
|---|---|---|
| low | 6 to 11 |
|
| fast | 0 : 56.11% 1 : 37.68% 2 : 5.36% 3 : 0.83% 4 : 0.02% |
|
| safe | 0 : 99.11% 1 : 0.89% 2-2 : 5.8e-4% |
|
The higher the precision, the greater the number of operations to be executed to achieve it, which can reduce the compute performance.