Vectorized Version Using a Single Kernel - 2022.1 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2022-05-25
Version
2022.1 English

AI Engine naturally supports multiple lanes of MAC operations. For variations of FIR applications, the group of aie::sliding_mul* classes and functions introduced in Multiple Lanes Multiplications - sliding_mul can be used.

In this section, we will choose aie::sliding_mul and aie::sliding_mac functions with Lanes=8 and Points=8. Both data and coefficient step sizes are 1, which is the default. For example, acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8); performs:
Lane 0: acc[0]=acc[0]+coe[1][0]*buff[8]+coe[1][1]*buff[9]+...+coe[1][7]*buff[15];
Lane 1: acc[1]=acc[1]+coe[1][1]*buff[9]+coe[1][1]*buff[10]+...+coe[1][7]*buff[16];
...
Lane 7: acc[7]=acc[7]+coe[1][7]*buff[15]+coe[1][7]*buff[16]+...+coe[1][7]*buff[22];

Notice that the data buff starts from different indexes in different lanes. It requires more than 8 samples (from buff[8] to buff[22]) to be ready before execution.

Since it has 32 taps, the FIR requires one aie::sliding_mul<8,8> operation and three aie::sliding_mac<8,8> operations to calculate eight lanes of output. The data buffer is updated from stream port by buff.insert.

The vectorized kernel code is as follows:

//keep margin data between different executions of graph
static aie::vector<cint16,32> delay_line;

alignas(aie::vector_decl_align) static cint16 eq_coef[32]={{1,2},{3,4},...};

__attribute__((noinline)) void fir_32tap_vector(input_stream<cint16> * sig_in, output_stream<cint16> * sig_out){
  const int LSIZE=(SAMPLES/32);
  aie::accum<cacc48,8> acc;
  const aie::vector<cint16,8> coe[4] = {aie::load_v<8>(eq_coef),aie::load_v<8>(eq_coef+8),aie::load_v<8>(eq_coef+16),aie::load_v<8>(eq_coef+24)};
  aie::vector<cint16,32> buff=delay_line;
  for(int i=0;i<LSIZE;i++){
    //performace 1st 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,0);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8);
    buff.insert(0,readincr_v<4>(sig_in));
    buff.insert(1,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,16);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,24);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));

    //performace 2nd 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,8);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,16);
    buff.insert(2,readincr_v<4>(sig_in));
    buff.insert(3,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,24);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,0);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));

    //performace 3rd 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,16);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,24);
    buff.insert(4,readincr_v<4>(sig_in));
    buff.insert(5,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,0);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,8);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));

    //performace 4th 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,24);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,0);
    buff.insert(6,readincr_v<4>(sig_in));
    buff.insert(7,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,8);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,16);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
  }
  delay_line=buff;
}
void fir_32tap_vector_init()
{
  //initialize data
  for (int i=0;i<8;i++){
    aie::vector<int16,8> tmp=get_wss(0);
    delay_line.insert(i,tmp.cast_to<cint16>());
  }
};
Note:
  • alignas(aie::vector_decl_align) can be used to ensure data is aligned for vector load and store.
  • Each iteration of the main loop computes multiple samples. Consequently, the loop count is reduced.
  • Data update, calculation and data write are interleaved in the code. Determining which portion of data buffer buff to read is controlled using data_start of aie::sliding_mul.
  • For more information about supported data types and lane numbers for aie::sliding_mul, see AI Engine API User Guide (UG1529).

The initiation interval of the main loop should be identified. To locate the initiation interval of the loop:

  1. Add the -v option to aiecompiler to output a verbose report of kernel compilation.
  2. Open the kernel compilation log, for example, Work/aie/<COL_ROW>/<COL_ROW>.log.
  3. In the log, search keywords, such as do-loop, to find the initiation interval of the loop.
    An example result follows:
    HW do-loop #2821 in ".../fir_32tap_vector.cc", line 21: (loop #3) :
    critical cycle of length 130 : ...
    minimum length due to resources: 128
    scheduling HW do-loop #2821
    (algo 2) -> # cycles: ......
    NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime
    -> # cycles: .....183 (exceeds -k 110) -> no folding: 183
    -> HW do-loop #2821 in ".../Vitis/2022.1/aietools/include/adf/stream/me/accessors.h", line 870: (loop #3) : 183 cycles
    where:
    • The initiation interval of the loop is 183. This means that a sample is produced in roughly 183/32~=6 cycles.
    • The message (exceeds -k 110) -> no folding indicates that the scheduler is not attempting software pipelining because the loop cycle count exceeds a limit.
  4. To override the loop cycle limit, add a user constraint, such as --Xchess="fir_32tap_vector:backend.mist2.maxfoldk=200" to the aiecompiler.

    The example result is then as follows:

    scheduling HW do-loop #2821
    (algo 2) -> # cycles: ......
    NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime
    -> # cycles: .....183 
    (modulo) -> # cycles: ... ok (required budget ratio: 2)
    ...
    (resume algo) -> after folding: 161 (folded over 1 iterations)
    -> HW do-loop #2821 in ".../Vitis/2022.1/aietools/include/adf/stream/me/accessors.h", line 870: (loop #3) : 161 cycles

    where, the software requires roughly 161/32~=5 cycles to produce a sample.