Arbitrary Precision Fixed-Point Data Types

Arbitrary Precision Fixed-Point Data Types - 2022.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2022-05-25

Version

2022.1 English

Some existing applications use floating-point data types as they are written for other hardware architectures. However, fixed-point data types are a useful replacement for floating-point types which require many clock cycles to complete. When choosing to implement floating-point versus fixed-point arithmetic for your application and accelerators, carefully evaluate trade-offs in power, cost, productivity, and precision.

As discussed in Reduce Power and Cost by Converting from Floating Point to Fixed Point (WP491), using fixed-point arithmetic instead of floating-point for applications can increase power efficiency, and lower the total power required. Unless the entire range of the floating-point type is required, the same accuracy can often be implemented with a fixed-point type, resulting in the same accuracy with smaller and faster hardware.

Fixed-point data types model the data as an integer and fraction bits. The fixed-point data type requires the ap_fixed header, and supports both a signed and unsigned form as follows:

Header file: ap_fixed.h
Signed fixed point: ap_fixed<W,I,Q,O,N>
Unsigned fixed point: ap_ufixed<W,I,Q,O,N>

W = Total width < 1024 bits
I = Integer bit width. The value of I must be less than or equal to the width (W). The number of bits to represent the fractional part is W minus I. Only a constant integer expression can be used to specify the integer width.
Q = Quantization mode. Only predefined enumerated values can be used to specify Q. The accepted values are:
- AP_RND: Rounding to plus infinity.
- AP_RND_ZERO: Rounding to zero.
- AP_RND_MIN_INF: Rounding to minus infinity.
- AP_RND_INF: Rounding to infinity.
- AP_RND_CONV: Convergent rounding.
- AP_TRN: Truncation. This is the default value when Q is not specified.
- AP_TRN_ZERO: Truncation to zero.
O = Overflow mode. Only predefined enumerated values can be used to specify O. The accepted values are:
- AP_SAT: Saturation.
- AP_SAT_ZERO: Saturation to zero.
- AP_SAT_SYM: Symmetrical saturation.
- AP_WRAP: Wrap-around. This is the default value when O is not specified.
- AP_WRAP_SM: Sign magnitude wrap-around.
N = The number of saturation bits in the overflow WRAP modes. Only a constant integer expression can be used as the parameter value. The default value is zero.

In the example code below, the ap_fixed type is used to define a signed 18-bit variable with 6 bits representing the integer value above the binary point, and by implication, 12 bits representing the fractional value below the binary point. The quantization mode is set to round to plus infinity (AP_RND). Because the overflow mode and saturation bits are not specified, the defaults AP_WRAP and 0 are used.

#include <ap_fixed.h>
...
  ap_fixed<18,6,AP_RND> my_type;
...

When performing calculations where the variables have different numbers of bits (W), or different precision (I), the binary point is automatically aligned. For more information on using fixed-point data types, see C++ Arbitrary Precision Fixed-Point Types in the Vitis HLS User Guide (UG1399).