The arbitrary precision floating-point library has been added to provide extra flexibility so that designers can investigate which precision best fits their design and perform trade-offs between resource costs and performance versus accuracy against standard floating-point formats.
The arbitrary precision floating-point library can match additional floating-point formats such as bfloat16, tf32, etc.
The header file needed is ap_float.h and must be add using #include <ap_float.h> to the source code. This library is provided as a templatized library ap_float<int W, int E> where W is the total bitwidth of the variable and E is the bitwidth of the exponent. The rest of the internal bit fields are the same as the IEEE 754 floating-point standard with a 1-bit sign and the remaining W-E-1 bits to hold the fraction bits of the significant. The significant has W-E bits: an implicit bit set to "1" followed by W-E-1 fraction bits.
Using the ap_float<> library, the normal 32-bit IEEE 754 floating point number type will be ap_float<32,8>; other common types are given in the following table.
C++ data type | ap_float<W,E> equivalent |
---|---|
float | ap_float<32,8> |
double | ap_float<64,11> |
half | ap_float<16,5> |
bfloat16 | ap_float<16,8> |
Table of correspondence for C++ common floating-point data types with ap_float<>
The following restrictions apply because they are inherited from the AMD Floating-Point LogiCORE IP (see PG060) which the ap_float<> library uses when the RTL is generated:
- Total bitwidth W can be up to 80 bits.
- Exponent bitwidth E from 4 to 16 bits.
- Mantissa bitwidth from 4 to 64 bits.
- Sub-normals are not supported: sub-normal results and arguments are rounded to 0.
- Round to Nearest ties to Even is the default and only supported rounding mode.
- All NaNs are Quiet NaNs (Signaling NaNs are not supported).
Since sub-normals are not supported, the significant will always have its implicit bit set to 1.
Supported Devices
All Versal and UltraScale+ FPGA Devices are supported.
Supported Functions
All common C++ floating-point arithmetic operators are supported:
- binary +,-,*,/,
- unary +=,-=,*=,/=,
- comparators,
- specific HLS functions like hls::sqrt, hls::fma, hls::abs
- accumulation is supported using operator+= or by using the specific ap_float_acc<> helper library discussed below.
No automatic conversion from ap_float<> to float so other functions will cause a compilation error. This is a design choice and is discussed below.
A simple example is shown:
#include <ap_float.h>
typedef ap_float<18, 6> my_float;
float apf_simple_example( float f_input) {
#pragma HLS pipeline
my_float pi = M_PI;
// Implicit conversion from float to ap_float<> is allowed
my_float r = f_input;
my_float area = pi*r*r;
if ( area>1.0f ) { area=1.0f; }
// Explicit conversion from ap_float<>
// Otherwise => error: no viable conversion from 'ap_float<18, 6>' to 'float’
float ret_value=(float)area;
return ret_value;
}