AI Engine naturally supports multiple lanes of MAC operations. For
variations of FIR applications, the group of aie::sliding_mul*
classes and functions introduced in Multiple Lanes Multiplication - sliding_mul can be used.
aie::sliding_mul
and aie::sliding_mac
functions with Lanes=8
and Points=8
. Both data and coefficient step sizes are 1, which is the
default. For example, acc =
aie::sliding_mac<8,8>(acc,coe[1],0,buff,8);
performs:Lane 0: acc[0]=acc[0]+coe[1][0]*buff[8]+coe[1][1]*buff[9]+...+coe[1][7]*buff[15];
Lane 1: acc[1]=acc[1]+coe[1][0]*buff[9]+coe[1][1]*buff[10]+...+coe[1][7]*buff[16];
...
Lane 7: acc[7]=acc[7]+coe[1][0]*buff[15]+coe[1][1]*buff[16]+...+coe[1][7]*buff[22];
Notice that the data buff
starts
from different indexes in different lanes. It requires more than 8 samples (from
buff[8]
to buff[22]
) to be ready before execution.
Since it has 32 taps, the FIR requires one aie::sliding_mul<8,8>
operation and three aie::sliding_mac<8,8>
operations to calculate
eight lanes of output. The data buffer is updated from stream port by buff.insert
.
The vectorized kernel code is as follows:
//keep margin data between different executions of graph
static aie::vector<cint16,32> delay_line;
alignas(aie::vector_decl_align) static cint16 eq_coef[32]={{1,2},{3,4},...};
__attribute__((noinline)) void fir_32tap_vector(input_stream<cint16> * sig_in, output_stream<cint16> * sig_out){
const int LSIZE=(SAMPLES/32);
aie::accum<cacc48,8> acc;
const aie::vector<cint16,8> coe[4] = {aie::load_v<8>(eq_coef),aie::load_v<8>(eq_coef+8),aie::load_v<8>(eq_coef+16),aie::load_v<8>(eq_coef+24)};
aie::vector<cint16,32> buff=delay_line;
for(int i=0;i<LSIZE;i++){
//1st 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,0);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8);
buff.insert(0,readincr_v<4>(sig_in));
buff.insert(1,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,16);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,24);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
//2nd 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,8);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,16);
buff.insert(2,readincr_v<4>(sig_in));
buff.insert(3,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,24);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,0);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
//3rd 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,16);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,24);
buff.insert(4,readincr_v<4>(sig_in));
buff.insert(5,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,0);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,8);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
//4th 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,24);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,0);
buff.insert(6,readincr_v<4>(sig_in));
buff.insert(7,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,8);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,16);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
}
delay_line=buff;
}
void fir_32tap_vector_init()
{
//initialize data
for (int i=0;i<8;i++){
aie::vector<int16,8> tmp=get_wss(0);
delay_line.insert(i,tmp.cast_to<cint16>());
}
};
-
alignas(aie::vector_decl_align)
can be used to ensure data is aligned for vector load and store. - Each iteration of the main loop computes multiple samples. Consequently, the loop count is reduced.
- Data update, calculation and data write are interleaved in
the code. Determining which portion of data buffer
buff
to read is controlled usingdata_start
ofaie::sliding_mul
. - For more information about supported data types and lane
numbers for
aie::sliding_mul
, see AI Engine API User Guide (UG1529).
The initiation interval of the main loop should be identified. To locate the initiation interval of the loop:
- Add the
-v
option to the AI Engine compiler to output a verbose report of kernel compilation. - Open the kernel compilation log, for example,
Work/aie/<COL_ROW>/<COL_ROW>.log
. - In the log, search keywords, such as
do-loop
, to find the initiation interval of the loop.An example result follows:where:HW do-loop #2821 in ".../fir_32tap_vector.cc", line 21: (loop #3) : critical cycle of length 130 : ... minimum length due to resources: 128 scheduling HW do-loop #2821 (algo 2) -> # cycles: ...... NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime -> # cycles: .....183 (exceeds -k 110) -> no folding: 183 -> HW do-loop #2821 in ".../Vitis/<VERSION>/aietools/include/adf/stream/me/stream_utils.h", line 870: (loop #3) : 183 cycles
- The initiation interval of the loop is 183. This means that a sample is produced in roughly 183/32~=6 cycles.
- The message
(exceeds -k 110) -> no folding
indicates that the scheduler is not attempting software pipelining because the loop cycle count exceeds a limit.
- To override the loop cycle limit, add a user constraint, such as
--Xchess="fir_32tap_vector:backend.mist2.maxfoldk=200"
to the AI Engine compiler.The example result is then as follows:
(resume algo) -> after folding: 160 (folded over 1 iterations) -> HW do-loop #3518 in ".../Vitis/<VERSION>/aietools/include/adf/stream/me/stream_utils.h", line 277: (loop #3) : 160 cycles
where, the software requires roughly 160/32=5 cycles to produce a sample.Note: The exact number of cycles may fluctuate slightly based on the specific compiler settings and the version of the compiler being used. However, the analysis techniques described in this section remain relevant and applicable regardless of these variations.