IEEE 754 Format Trick - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

In addition to basic arithmetic operations, softmax computation depends on efficient evaluation of the exponential function. While there are several ways to accomplish this, an attractive alternative is to estimate the exponential function using a trick based on IEEE 754 floating-point format [2]. A double-precision, floating-point number represented by IEEE 754 format is shown in the following figure.

figure3

Figure 3 - IEEE 754 Format for Double-Precision Numbers

This format is used to represent a number $(-1)^s(1+m)2^{x-x_0}\(, where \)s$ is a sign bit, \(m\) is the 52-bit fractional part of a normalized mantissa, and \(x\) is an 11-bit exponent with bias $x_0 = 1023$.

Approximation is based on the identity \(e^y = 2^{y/log(2)}\). So for any floating-point number \(y\), the value \(e^y\) is approximated by setting the exponent \(x\) of the result to $y/log(2) + x_0$. To perform the computation, it helps to divide a double precision number into two groups comprised of upper 32 bits and lower 32 bits. The lower 32 bits are set to 0 in this approximation, while the upper 32 bits are the same bits used to represent the signed 32-bit integer value

$$ I_{upper} = \left\lfloor \frac{2^{20}y}{log(2)} + 2^{20}x_0 - C \right\rfloor . $$

A factor of \(2^{20}\) represents a binary shift necessary to align with the exponent field of the IEEE 754 format. Residual mantissa bits help provide a degree of interpolation between exponent values. The parameter \(C\) is a correction factor meant to mitigate estimation error. It was found that a value of \(C=60801\) minimizes RMS error [2]. This estimation method may be adapted for other variations of floating-point number representations, such as 32-bit single-precision floating-point.