Vectorized Version Using a Single Kernel - 2024.2 English - UG1079

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2024-11-28
Version
2024.2 English

AI Engine naturally supports multiple lanes of MAC operations. For variations of FIR applications, the group of aie::sliding_mul* classes and functions introduced in Multiple Lanes Multiplication - sliding_mul can be used.

In this section, you will choose aie::sliding_mul and aie::sliding_mac functions with Lanes=8 and Points=8. Both data and coefficient step sizes are 1, which is the default. For example, acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8); performs:
Lane 0: acc[0]=acc[0]+coe[1][0]*buff[8]+coe[1][1]*buff[9]+...+coe[1][7]*buff[15];
Lane 1: acc[1]=acc[1]+coe[1][0]*buff[9]+coe[1][1]*buff[10]+...+coe[1][7]*buff[16];
...
Lane 7: acc[7]=acc[7]+coe[1][0]*buff[15]+coe[1][1]*buff[16]+...+coe[1][7]*buff[22];

Notice that the data buff starts from different indexes in different lanes. It requires more than 8 samples (from buff[8] to buff[22]) to be ready before execution.

Since it has 32 taps, the FIR requires one aie::sliding_mul<8,8> operation and three aie::sliding_mac<8,8> operations to calculate eight lanes of output. The data buffer is updated from stream port by buff.insert.

The vectorized kernel code is as follows:

//keep margin data between different executions of graph
static aie::vector<cint16,32> delay_line;

alignas(aie::vector_decl_align) static cint16 eq_coef[32]={{1,2},{3,4},...};

__attribute__((noinline)) void fir_32tap_vector(input_stream<cint16> * sig_in, output_stream<cint16> * sig_out){
  const int LSIZE=(SAMPLES/32);
  aie::accum<cacc48,8> acc;
  const aie::vector<cint16,8> coe[4] = {aie::load_v<8>(eq_coef),aie::load_v<8>(eq_coef+8),aie::load_v<8>(eq_coef+16),aie::load_v<8>(eq_coef+24)};
  aie::vector<cint16,32> buff=delay_line;
  for(int i=0;i<LSIZE;i++){
    //1st 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,0);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8);
    buff.insert(0,readincr_v<4>(sig_in));
    buff.insert(1,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,16);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,24);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));

    //2nd 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,8);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,16);
    buff.insert(2,readincr_v<4>(sig_in));
    buff.insert(3,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,24);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,0);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));

    //3rd 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,16);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,24);
    buff.insert(4,readincr_v<4>(sig_in));
    buff.insert(5,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,0);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,8);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));

    //4th 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,24);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,0);
    buff.insert(6,readincr_v<4>(sig_in));
    buff.insert(7,readincr_v<4>(sig_in));
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,8);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,16);
    writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
  }
  delay_line=buff;
}
void fir_32tap_vector_init()
{
  //initialize data
  for (int i=0;i<8;i++){
    aie::vector<int16,8> tmp=get_wss(0);
    delay_line.insert(i,tmp.cast_to<cint16>());
  }
};
Note:
  • alignas(aie::vector_decl_align) can be used to ensure data is aligned for vector load and store.
  • Each iteration of the main loop computes multiple samples. Consequently, the loop count is reduced.
  • Data update, calculation and data write are interleaved in the code. Determining which portion of data buffer buff to read is controlled using data_start of aie::sliding_mul.
  • For more information about supported data types and lane numbers for aie::sliding_mul, see AI Engine API User Guide (UG1529).

The initiation interval of the main loop should be identified. To locate the initiation interval of the loop:

  1. Add the -v option to the AI Engine compiler to output a verbose report of kernel compilation.
  2. Open the kernel compilation log, for example, Work/aie/<COL_ROW>/<COL_ROW>.log.
  3. In the log, search keywords, such as do-loop, to find the initiation interval of the loop.
    An example result follows:
    HW do-loop #2821 in ".../fir_32tap_vector.cc", line 21: (loop #3) :
    critical cycle of length 130 : ...
    minimum length due to resources: 128
    scheduling HW do-loop #2821
    (algo 2) -> # cycles: ......
    NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime
    -> # cycles: .....183 (exceeds -k 110) -> no folding: 183
    -> HW do-loop #2821 in ".../Vitis/<VERSION>/aietools/include/adf/stream/me/stream_utils.h", line 870: (loop #3) : 183 cycles
    where:
    • The initiation interval of the loop is 183. This means that a sample is produced in roughly 183/32~=6 cycles.
    • The message (exceeds -k 110) -> no folding indicates that the scheduler is not attempting software pipelining because the loop cycle count exceeds a limit.
  4. To override the loop cycle limit, add a user constraint, such as --Xchess="fir_32tap_vector:backend.mist2.maxfoldk=200" to the AI Engine compiler.

    The example result is then as follows:

    (resume algo)	  -> after folding: 160  (folded over 1 iterations)
      -> HW do-loop #3518 in ".../Vitis/<VERSION>/aietools/include/adf/stream/me/stream_utils.h", line 277: (loop #3) : 160 cycles
    where, the software requires roughly 160/32=5 cycles to produce a sample.
    Note: The exact number of cycles may fluctuate slightly based on the specific compiler settings and the version of the compiler being used. However, the analysis techniques described in this section remain relevant and applicable regardless of these variations.