Arbitrary Precision Floating-Point Library ap_float<int W, int E> - 2024.2 English - UG1399

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2024-11-13
Version
2024.2 English

The arbitrary precision floating-point library has been added to provide extra flexibility so that designers can investigate which precision best fits their design and perform trade-offs between resource costs and performance versus accuracy against standard floating-point formats.

The arbitrary precision floating-point library can match additional floating-point formats such as bfloat16, tf32, etc.

The header file needed is ap_float.h and must be add using #include <ap_float.h> to the source code. This library is provided as a templatized library ap_float<int W, int E> where W is the total bitwidth of the variable and E is the bitwidth of the exponent. The rest of the internal bit fields are the same as the IEEE 754 floating-point standard with a 1-bit sign and the remaining W-E-1 bits to hold the fraction bits of the significant. The significant has W-E bits: an implicit bit set to "1" followed by W-E-1 fraction bits.

Using the ap_float<> library, the normal 32-bit IEEE 754 floating point number type will be ap_float<32,8>; other common types are given in the following table.

C++ data type ap_float<W,E> equivalent
float ap_float<32,8>
double ap_float<64,11>
half ap_float<16,5>
bfloat16 ap_float<16,8>

Table of correspondence for C++ common floating-point data types with ap_float<>

Note: Designs do not need to change as the standard C++ float and double floating-point datatypes are supported natively by the compiler.

The following restrictions apply because they are inherited from the AMD Floating-Point LogiCORE IP (see PG060) which the ap_float<> library uses when the RTL is generated:

  1. Total bitwidth W can be up to 80 bits.
  2. Exponent bitwidth E from 4 to 16 bits.
  3. Mantissa bitwidth from 4 to 64 bits.
  4. Sub-normals are not supported: sub-normal results and arguments are rounded to 0.
  5. Round to Nearest ties to Even is the default and only supported rounding mode.
  6. All NaNs are Quiet NaNs (Signaling NaNs are not supported).

Since sub-normals are not supported, the significant will always have its implicit bit set to 1.

Supported Devices

All Versal and UltraScale+ FPGA Devices are supported.

Supported Functions

All common C++ floating-point arithmetic operators are supported:

  • binary +,-,*,/,
  • unary +=,-=,*=,/=,
  • comparators,
  • specific HLS functions like hls::sqrt, hls::fma, hls::abs
  • accumulation is supported using operator+= or by using the specific ap_float_acc<> helper library discussed below.

No automatic conversion from ap_float<> to float so other functions will cause a compilation error. This is a design choice and is discussed below.

A simple example is shown:

#include <ap_float.h>

typedef ap_float<18, 6> my_float;

float apf_simple_example( float f_input) {
    #pragma HLS pipeline
    my_float pi   = M_PI;

    // Implicit conversion from float to ap_float<> is allowed
    my_float r    = f_input; 

    my_float area = pi*r*r;
    if ( area>1.0f ) { area=1.0f; }

    // Explicit conversion from ap_float<>
    // Otherwise => error: no viable conversion from 'ap_float<18, 6>' to 'float’
    float ret_value=(float)area; 

    return ret_value;
}