Adapting for bfloat16 Floating-Point - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

AMD Versal™ Edge Adaptive SoCs primarily contain a variant of AI Engine processor, which has bfloat16 floating-point as a native data type. The structure of a bfloat16 number is shown in the following figure.

figure4

Figure 4 - bfloat16 Floating-Point Format

This format is structurally similar to double-precision format, but with reduced dynamic range and precision due to fewer bits being used to represent the exponent and mantissa. The number of exponent bits is the same as IEEE 754 single-precision format, giving bfloat16 essentially the same dynamic range. This is attractive in deep learning and other applications where dynamic range is more important than precision. To adapt exponential function approximation to bfloat16 data types, the equation becomes

$$ I = \left\lfloor \frac{2^{7}}{log(2)} \left( y - log(2) F(y_f) \right) + 2^{7}x_0 \right\rfloor $$

where $x_0 = 127$ and \(I\) is computed as a signed 16-bit integer which is then reinterpreted as bfloat16.

The correction function $F(y_f)$ may be approximated with a polynomial. However, quantization introduced by bfloat16 arithmetic used to evaluate the polynomial, and remaining softmax function computation, counteracts the benefit of using a correction function. In addition, for bfloat16 numbers greater than 128.0, there is no fractional part to be used as an argument for the correction polynomial. When considering exponential function evaluation with bfloat16 data types, unnecessary computation can be avoided by using the simpler estimation

$$ I = \left\lfloor \frac{2^{7}}{log(2)} y + 2^{7}x_0 \right\rfloor $$

which becomes

$$ I = \left\lfloor 185 y + 16256 \right\rfloor $$

after taking quantization of coefficients into account. For an example of using a correction function in exponential function estimation, consult the Softmax Function Tutorial based on single-precision floating-point computation.