The HLS compiler attempts to identify Add-Multiply, Multiply-Add, and AMA operations (as well as Subtract-Multiply, Multiply-Subtract operations) to match their assignment to DSPs on the physical device. The HLS compiler uses the following process to match operations to DSPs:
- Inline functions as needed
- Extract common sub-expressions
- Match fanout-free
add->mul
,mul->add
, andadd->mul->add
(whereadd
can also besub
) code fragments meeting bitwidth limitations of DSPs - Schedule
- Select implementation in DSP or LUT guided by the BIND_OP pragma
The ability of the HLS compiler to match operations or expressions to DSP implementation depends on the following:
- Datatypes and bitwidths
- Coding style and expressions
- The use of BIND_OP and other pragmas
Datatypes and Bitwidths
DSP matching is limited by design restrictions of DSP48 and DSP58 on the target device. Refer to UltraScale Architecture DSP Slice (UG579) for specific limitations on DSP48, and Versal Adaptive SoC DSP Engine Architecture Manual (AM004) for limits on DSP58.
As an example, in Versal devices the following are the port widths for DSP58:
- B 24 bits
- A 34 bits
- D 27 bits
- C 58 bits
- P 58 bits
Coding Style
The style of writing the code matters:
- Bitwidths must be small enough to fit the DSP, including intermediate results
- No fanout is allowed, even as a result of common sub-expression elimination.
For example the following code will not be matched for DSP assignment due to the
extraction of the sub-expression
a*b
:f1 = a*b+c; f2= a*b+d;
Tip: To avoid this problem use non-inlined functions containing exactly the expressions to be matched. - Non-inlined functions with specified latency, for example MULADD with latency 2, cannot be matched to an accumulator with II=1. The reason is that both multiplier input and the adder input must be ready 2 cycles before the accumulation output is produced. In that case, the MULADD or MAC function (if any) must be inlined.
BIND_OP and Other Pragmas
For DSP matching, some pragmas or directives can prevent the matching of code for
implementation in DSPs, while other pragmas or directives can help. In the following
code example, the BIND_OP pragma actually
prevents the assignment of the MULADD expression to DSP. By assigning the
mul
operation it prevents the HLS compiler from recognizing and
assigning the MULADD expression. Removing the pragma in this case will let the
compiler make the assignment.
static const int NFRAC = 14;
typedef ap_fixed<3+NFRAC, 3, AP_TRN, AP_WRAP> DATA_T;
…
ACC_T MAC( DATA_T din0, COEF_T coef, ACC_T acc ) {
acc = din0 * coef + acc;
#pragma HLS BIND_OP variable=acc op=mul impl=dsp latency=2
return acc;
}
void process ( DATA_T din, DATA_T* dout0) {
#pragma HLS pipeline II=1
…
LOOP_MAC: for (int i=0; i<l_COEFF; i++) {
acc = MAC (sr[tdl_index], coeff[i], acc);
tdl_index = tdl_index-1;
}
The PIPELINE pragma causes the loop to be unrolled with expression balancing that also prevents DSP matching. In this case, adding EXPRESSION_BALANCE OFF to disable expression balancing will help the HLS compiler recognize the multi-operation expressions for DSP matching. The reworked code example below will have improved DSP matching.
static const int NFRAC = 14;
typedef ap_fixed<3+NFRAC, 3, AP_TRN, AP_WRAP> DATA_T;
…
ACC_T MAC( DATA_T din0, COEF_T coef, ACC_T acc ) {
acc = din0 * coef + acc;
return acc;
}
void process ( DATA_T din, DATA_T* dout0) {
#pragma HLS pipeline II=1
#pragma HLS expression_balance off
…
LOOP_MAC: for (int i=0; i<l_COEFF; i++) {
acc = MAC (sr[tdl_index], coeff[i], acc);
tdl_index = tdl_index-1;
}