DSP Complex Intrinsics - 2025.1 English - UG1399

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2025-05-29
Version
2025.1 English

On Versal only, a set of DSP intrinsics is enabled in the namespace hls::dspcplx, to perform complex-type operations such as multiply, multiply-add, and accumulate, by using the DSP CPLX primitive for example, 2 DSP58 to perform 18x18 complex-type multiplications. The DSP Complex Intrinsics are using the following data types:

  • std::complex<ap_int<18>> for multiplier inputs, defined as typedef A_t & B_t
  • std::complex<ap_int<58>> defined as as typedef C_t for:
    • adder input (for mul_add)
    • multiplier output, and accumulator output (for mul_acc)

Using only 2 DSP58s instead of 3. This results in a significant resource saving with respect to using std::complex<ap_int<18>> in functional C++ code.

The available functions are mul and mul_add, and the acc and cascade classes can be used for, respectively, accumulation and cascading, similarly to integer-datatypes intrinsic (including the struct R_t, which captures the cascade outputs for ACOUT, BCOUT, and PCOUT).
Table 1. DSP Complex Intrinsics Funtions
Function Operation Signature
mul P=A*B template <int64_t flags> C_t mul_add( A_t a, B_t b)
mul_add P=A*B+C template <int64_t flags> C_t mul_add( A_t a, B_t b, C_t c)
mul_acc C=init?A*B:A*B+C template <int64_t flags> class acc { C_t mul_acc(A_t a, B_t b, bool init);}
cascade, mul_add P=A*B+C template <int64_t flags> class cascade { R_t mul_add(A_t a, B_t b, C_t c);}

The code examples below show their use::

  • For the values of the register flags, the designer can freely choose to use only the M and P registers, while the A, B, C, and AD registers only support 3 configurations, called
    • NO_PIPELINE (none of REG_A1, REG_A2, REG_B1, REG_B2, REG_AD, REG_D or REG_C)
    • BALANCED_PIPELINE (REG_A1 and REG_B1)
    • FULL_PIPELINE (REG_A1, REG_A2, REG_B1, REG_B2, REG_AD)
    • The above 3 symbols are defined in the namespace similarly to the other REG_xx symbols
  • The examples show both the use of pairs of real inputs and outputs (For example, ap_int<18>), as well as complex inputs and outputs (For example, hls::dspcplx::A_t as a typedef to std::complex<ap_int<18>>).
  • The test_mul example has two versions:
    • test_mul1 uses only the intrinsic methods. As a result, its RTL implementation:
    • Guaranteed to use 4 DSPs (For example, 2 DSPCPLX instances)
    • May or may not use the cascaded connection between them, because the Vitis HLS scheduler can insert registers between the DSPCPLX that may prevent the inference of the cascade collections.
    • test_mul2 uses the cascade class to ensure both DSPCPLX inference and use of cascade connections (in this case, scheduling constraints are added to ensure that the proper number of registers is used between the DSPCPLX instances).
  • Inference of DSPCPLX is guaranteed only for std::complex<ap_int<8>> and higher bitwidths.
using namespace hls::dspcplx;

// example of intrinsic method, inferencing one DSPCPLX
void test_mul_add(
  ap_int<18> a_r, ap_int<18> a_i,
  ap_int<18> b_r, ap_int<18> b_i,
  ap_int<58> c_r, ap_int<58> c_i,
  ap_int<58>& r_r, ap_int<58>& r_i) {
    std::complex<ap_int<18>> a = {a_r, a_i}; 
    std::complex<ap_int<18>> b = {b_r, b_i}; 
    std::complex<ap_int<58>> c = {c_r, c_i}; 
    std::complex<ap_int<58>> r = mul_add<NO_PIPELINE | REG_M>(a, b, c);
    r_r = r.real();
    r_i = r.imag();
}

// example of two intrinsic methods, inferencing two DSPCPLX
void test_mul1(A_t a0, A_t a1, B_t b0, B_t b1, C_t &r) {
    // computes z = a0 * b0 + a1 * b1
    C_t t = mul<FULL_PIPELINE>(a0, b0);
    r = mul_add<FULL_PIPELINE | REG_M>(a1, b1, t);
}

// same example as above, but using the cascade class to ensure cascaded connection
void test_mul2(A_t a0, A_t a1, B_t b0, B_t b1, C_t &z) {
    #pragma HLS pipeline II=1
    // computes z = a0 * b0 + a1 * b1
    cascade<REG_A1|REG_B1|REG_M|REG_P > dsp0;
    cascade<REG_A1|REG_B1|REG_M|REG_P > dsp1;
    C_t zero = {0, 0};
    // mul is not available for cascade, so use mul_add(...,0);
    auto tmp = dsp0.mul_add(a0, b0, zero); 
    auto out = dsp0.mul_add(a1, b1, tmp.pcout);
    z=out.pcout;
}

// example showing the acc class, inferencing one DSPCPLX (with accumulator)

const int N=10;

void test_mul_acc(A_t sr[N], B_t coeff[N], C_t &res) {
    C_t t = {0, 0};
    // computes t += sr[i] * coeff[i] for all i in {0 .. N-1}
    acc<BALANCED_PIPELINE | REG_M | REG_P> my_acc;
acc_loop:
    for (int i = 0; i < N; i++) {
        #pragma HLS pipeline II=1
        t = my_acc.mul_acc( sr[i], coeff[i], /* init condition */ i == 0);
    }
    res = t; 
}

// example showing cascades to infer "size" DSPCPLX and using cascaded connections
const int N=10;
const int TAPS=4;

template <int size>
class FIR{
     private:
         const A_t* coeff_; 
         C_t bias_;
         cascade<BALANCED_PIPELINE|REG_M|REG_P > dsp0;
         cascade<    FULL_PIPELINE|REG_M|REG_P > dsp_full[size - 1];
     public:
        FIR(const A_t* coeff, C_t bias) : coeff_(coeff), bias_(bias) {
            #pragma HLS ARRAY_PARTITION variable=dsp_full complete dim=1 
        };

        C_t fir(B_t input){
            auto out = dsp0.mul_add(coeff_[0], input, bias_);
            for(int j=1; j < size ; j++){
                #pragma HLS unroll
                out = dsp_full[j - 1].mul_add(coeff_[j], out.bcout, out.pcout);
            }
            return out.pcout;
        }; // end of fir function
}; // end of FIR class

void systolic_fir( A_t coeff[TAPS], B_t b[N], C_t bias, C_t hw[N]) {
    #pragma HLS ARRAY_PARTITION variable=coeff complete dim=1
   
    FIR<TAPS> my_fir(coeff, bias);

    LOOP_FIR:
    for(auto i = 0 ; i< N ; i++){
        #pragma HLS pipeline II=1
        hw[i] = my_fir.fir(b[i]);
    }
}