Floating-Point Accuracy - 2025.2 English - UG1603

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2025-11-26
Version
2025.2 English

The accuracy flag (low | fast | safe) provides a tradeoff between performance and acceptable error in terms of ULP (Unit in the Last Place). The product of two bfloat16 operands is in single-precision floating-point (fp32) format. The bfloat16 vector multiplier is immediately followed by an fp32 adder in the vector processor execution pipeline to implement multiply-accumulate or multiply-add operations.

fp32 values have an 8-bit exponent and a 23-bit mantissa with an implicit leading 1 for normal numbers::

  • Maximum positive value: (2-2-23)x2127~=3.403 x 1038
  • Minimum positive value: 2-126~=1.175x10-38
bfloat16 numbers retain the 8-bit exponent, but the mantissa is reduced to 7 bits with an implicit leading 1. An fp32 value is emulated using the sum of 3 bfloat16 values. Unfortunately, this representation may not be exact for some extremely low values as the exponent of some of the {{bfloat16 }} values might be too small to be representable as a normal value.
Note: Subnormal values are flushed to zero.

With X as an fp32 value and Z_0, Z_1, and Z_2 as bfloat16 values, then:

  • Z_0=(bfloat16)X # cast X to a bfloat16 value
  • Z_1=(bfloat16)(X -(float) Z_0) # cast Z_0 to an fp32 value
  • Z_2=(bfloat16)(X-(float) Z_0-(float) Z_1)
  • X≈(float) Z_0+(float) Z_1+(float)Z_2

Based on the selected accuracy of the floating-point operation, the following table shows the ULP error when emulating fp32 multiplication and the corresponding code that is executed.

These ULPs are given for X and Y values such that X, Y, and X*Y have an exponent in the range [-102, +126]. This is equivalent to an FP32 magnitude in the range [1.97e-31 ; 1.70e+38]

Table 1. Accuracy, ULP, and Corresponding Code
Accuracy Flag ULP Range or ULP frequency Executed Assembly Code
low 6 to 11
VMUL.f bmh2, x4, x6, r1
VMAC.f bml3, bmh2, x7, x3, r1
VMAC.f bmh3, bml3, x7, x6, r1
fast 0 : 56.11%

1 : 37.68%

2 : 5.36%

3 : 0.83%

4 : 0.02%

VMUL.f bmh3, x7, x5, r1
VMAC.f bml4, bmh3, x3, x2, r1
VMAC.f bmh4, bml4, x8, x4, r1
VMAC.f bml5, bmh4, x3, x8, r1
VMAC.f bmh5, bml5, x7, x2, r1
VMAC.f bml6, bmh5, x7, x8, r1
safe 0 : 99.11%

1 : 0.89%

2 : 5.8e-4%

VMUL.f bmh3, x4, x8, r1
VMAC.f bml4, bmh3, x4, x2, r1
VMAC.f bmh4, bml4, x3, x8, r1
VMAC.f bml5, bmh4, x5, x8, r1
VMAC.f bmh5, bml5, x3, x2, r1
VMAC.f bml8, bmh5, x7, x4, r1
VMAC.f bml6, bml8, x3, x7, r1
VMAC.f bml7, bml6, x5, x2, r1
VMAC.f bmh6, bml7, x5, x7, r1

Higher accuracy requires more operations, resulting in reduced throughput and longer latency.