DSP Multi-Operation Matching

DSP Multi-Operation Matching - 2023.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID

UG1399

Release Date

2023-12-18

Version

2023.2 English

The HLS compiler attempts to identify Add-Multiply, Multiply-Add, and AMA operations (as well as Subtract-Multiply, Multiply-Subtract operations) to match their assignment to DSPs on the physical device. The HLS compiler uses the following process to match operations to DSPs:

Inline functions as needed
Extract common sub-expressions
Match fanout-free add->mul, mul->add, and add->mul->add (where add can also be sub) code fragments meeting bitwidth limitations of DSPs
Schedule
Select implementation in DSP or LUT guided by the BIND_OP pragma

The ability of the HLS compiler to match operations or expressions to DSP implementation depends on the following:

Datatypes and bitwidths
Coding style and expressions
The use of BIND_OP and other pragmas

Datatypes and Bitwidths

DSP matching is limited by design restrictions of DSP48 and DSP58 on the target device. Refer to UltraScale Architecture DSP Slice (UG579) for specific limitations on DSP48, and Versal Adaptive SoC DSP Engine Architecture Manual (AM004) for limits on DSP58.

As an example, in Versal devices the following are the port widths for DSP58:

B 24 bits
A 34 bits
D 27 bits
C 58 bits
P 58 bits

Coding Style

The style of writing the code matters:

Bitwidths must be small enough to fit the DSP, including intermediate results
No fanout is allowed, even as a result of common sub-expression elimination. For example the following code will not be matched for DSP assignment due to the extraction of the sub-expression a*b:
```
f1 = a*b+c;
f2= a*b+d;
```
Tip: To avoid this problem use non-inlined functions containing exactly the expressions to be matched.
Non-inlined functions with specified latency, for example MULADD with latency 2, cannot be matched to an accumulator with II=1. The reason is that both multiplier input and the adder input must be ready 2 cycles before the accumulation output is produced. In that case, the MULADD or MAC function (if any) must be inlined.

BIND_OP and Other Pragmas

For DSP matching, some pragmas or directives can prevent the matching of code for implementation in DSPs, while other pragmas or directives can help. In the following code example, the BIND_OP pragma actually prevents the assignment of the MULADD expression to DSP. By assigning the mul operation it prevents the HLS compiler from recognizing and assigning the MULADD expression. Removing the pragma in this case will let the compiler make the assignment.

static const int NFRAC = 14;
typedef ap_fixed<3+NFRAC,  3, AP_TRN, AP_WRAP> DATA_T;
…
ACC_T MAC( DATA_T din0, COEF_T coef, ACC_T  acc ) {
    acc = din0 * coef + acc;
#pragma HLS BIND_OP variable=acc op=mul impl=dsp latency=2
    return acc;
}
void process ( DATA_T din, DATA_T* dout0) {
#pragma HLS pipeline II=1
…
 LOOP_MAC: for (int i=0; i<l_COEFF; i++) {
    acc = MAC (sr[tdl_index], coeff[i], acc);
    tdl_index = tdl_index-1;
}

The PIPELINE pragma causes the loop to be unrolled with expression balancing that also prevents DSP matching. In this case, adding EXPRESSION_BALANCE OFF to disable expression balancing will help the HLS compiler recognize the multi-operation expressions for DSP matching. The reworked code example below will have improved DSP matching.

static const int NFRAC = 14;
typedef ap_fixed<3+NFRAC,  3, AP_TRN, AP_WRAP> DATA_T;
…
ACC_T MAC( DATA_T din0, COEF_T coef, ACC_T  acc ) {
    acc = din0 * coef + acc;
    return acc;
}
void process ( DATA_T din, DATA_T* dout0) {
#pragma HLS pipeline II=1
#pragma HLS expression_balance off
…
 LOOP_MAC: for (int i=0; i<l_COEFF; i++) {
    acc = MAC (sr[tdl_index], coeff[i], acc);
    tdl_index = tdl_index-1;
}