Floats and Doubles - 2020.2 English - UG1399

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2021-03-22
Version
2020.2 English

Vitis HLS supports float and double types for synthesis. Both data types are synthesized with IEEE-754 standard partial compliance (see Floating-Point Operator LogiCORE IP Product Guide (PG060)).

  • Single-precision 32-bit
    • 24-bit fraction
    • 8-bit exponent
  • Double-precision 64-bit
    • 53-bit fraction
    • 11-bit exponent
Recommended: When using floating-point data types, Xilinx highly recommends that you review Floating-Point Design with Vivado HLS (XAPP599).

In addition to using floats and doubles for standard arithmetic operations (such as +, -, * ) floats and doubles are commonly used with the math.h (and cmath.h for C++). This section discusses support for standard operators.

The following code example shows the header file used with Standard Types updated to define the data types to be double and float types.

#include <stdio.h>
#include <stdint.h>
#include <math.h>

#define N 9

typedef double din_A;
typedef double din_B;
typedef double din_C;
typedef float din_D;

typedef double dout_1;
typedef double dout_2;
typedef double dout_3;
typedef float dout_4;

void types_float_double(din_A inA,din_B inB,din_C inC,din_D inD,dout_1 
*out1,dout_2 *out2,dout_3 *out3,dout_4 *out4);

This updated header file is used with the following code example where a sqrtf() function is used.

#include "types_float_double.h"

void types_float_double(
 din_A  inA,
 din_B  inB,
 din_C  inC,
 din_D  inD,
 dout_1 *out1,
 dout_2 *out2,
 dout_3 *out3,
 dout_4 *out4
 ) {

 // Basic arithmetic & math.h sqrtf() 
 *out1 = inA * inB;
 *out2 = inB + inA;
 *out3 = inC / inA;
 *out4 = sqrtf(inD);

}

When the example above is synthesized, it results in 64-bit double-precision multiplier, adder, and divider operators. These operators are implemented by the appropriate floating-point Xilinx IP catalog cores.

The square-root function used sqrtf() is implemented using a 32-bit single-precision floating-point core.

If the double-precision square-root function sqrt() was used, it would result in additional logic to cast to and from the 32-bit single-precision float types used for inD and out4: sqrt() is a double-precision (double) function, while sqrtf() is a single precision (float) function.

In C functions, be careful when mixing float and double types as float-to-double and double-to-float conversion units are inferred in the hardware.

float foo_f    = 3.1459;
float var_f = sqrt(foo_f); 

The above code results in the following hardware:

wire(foo_t)
-> Float-to-Double Converter unit
-> Double-Precision Square Root unit
-> Double-to-Float Converter unit
-> wire (var_f)

Using a sqrtf() function:

  • Removes the need for the type converters in hardware
  • Saves area
  • Improves timing

When synthesizing float and double types, Vitis HLS maintains the order of operations performed in the C code to ensure that the results are the same as the C simulation. Due to saturation and truncation, the following are not guaranteed to be the same in single and double precision operations:

       A=B*C; A=B*F;
       D=E*F; D=E*C;
       O1=A*D O2=A*D;

With float and double types, O1 and O2 are not guaranteed to be the same.

Tip: In some cases (design dependent), optimizations such as unrolling or partial unrolling of loops, might not be able to take full advantage of parallel computations as Vitis HLS maintains the strict order of the operations when synthesizing float and double types. This restriction can be overridden using config_compile -unsafe_math_optimizations.

For C++ designs, Vitis HLS provides a bit-approximate implementation of the most commonly used math functions.

Floating-Point Accumulator and MAC

Accumulation for floats and doubles is supported with an initiation interval (II) of 1 on all devices. This means that the following code can be pipelined with an II of 1 without any additional coding:
float foo(float A[10], float B[10]) {
 float sum = 0.0;
 for (int i = 0; i < 10; i++) {
 sum += A[i] * B[i];
 }
 return sum;
}
Note: The multiplication and accumulation (MAC) above will be implemented on Versal devices with a single floating-point MAC operator. On other devices, a single or double precision accumulator and a multiplier will be inferred. You can globally disable the use of the floating-point accumulator and MAC operators by using the following Tcl commands, either in a Tcl script in your project or from the Vitis HLS command line:
# Enable or disable double precision accumulation (true by default)
::common::set_param hls.enable_float_acc_inference false
# Enable or disable double precision MAC on Versal devices (true by default)
::common::set_param hls.enable_float_mul_acc_inference false