AI Engine naturally supports multiple lanes of MAC operations. For
variations of FIR applications, the group of aie::sliding_mul*
classes and functions introduced in Multiple Lanes Multiplication - sliding_mul can be used.
aie::sliding_mul
and aie::sliding_mac
functions with Lanes=8
and Points=8
. Both data and coefficient step sizes are 1, which is the
default. For example, acc =
aie::sliding_mac<8,8>(acc,coe[1],0,buff,8);
performs:Lane 0: acc[0]=acc[0]+coe[1][0]*buff[8]+coe[1][1]*buff[9]+...+coe[1][7]*buff[15];
Lane 1: acc[1]=acc[1]+coe[1][0]*buff[9]+coe[1][1]*buff[10]+...+coe[1][7]*buff[16];
...
Lane 7: acc[7]=acc[7]+coe[1][0]*buff[15]+coe[1][1]*buff[16]+...+coe[1][7]*buff[22];
Notice that the data buff
starts
from different indexes in different lanes. It requires more than 8 samples (from
buff[8]
to buff[22]
) to be ready before execution.
Since it has 32 taps, the FIR requires one aie::sliding_mul<8,8>
operation and three aie::sliding_mac<8,8>
operations to calculate
eight lanes of output. The data buffer is updated from stream port by buff.insert
.
The vectorized kernel code is as follows:
//keep margin data between different executions of graph
static aie::vector<cint16,32> delay_line;
alignas(aie::vector_decl_align) static cint16 eq_coef[32]={{1,2},{3,4},...};
__attribute__((noinline)) void fir_32tap_vector(input_stream<cint16> * sig_in, output_stream<cint16> * sig_out){
const int LSIZE=(SAMPLES/32);
aie::accum<cacc48,8> acc;
const aie::vector<cint16,8> coe[4] = {aie::load_v<8>(eq_coef),aie::load_v<8>(eq_coef+8),aie::load_v<8>(eq_coef+16),aie::load_v<8>(eq_coef+24)};
aie::vector<cint16,32> buff=delay_line;
for(int i=0;i<LSIZE;i++){
//1st 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,0);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8);
buff.insert(0,readincr_v<4>(sig_in));
buff.insert(1,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,16);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,24);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
//2nd 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,8);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,16);
buff.insert(2,readincr_v<4>(sig_in));
buff.insert(3,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,24);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,0);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
//3rd 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,16);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,24);
buff.insert(4,readincr_v<4>(sig_in));
buff.insert(5,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,0);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,8);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
//4th 8 samples
acc = aie::sliding_mul<8,8>(coe[0],0,buff,24);
acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,0);
buff.insert(6,readincr_v<4>(sig_in));
buff.insert(7,readincr_v<4>(sig_in));
acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,8);
acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,16);
writeincr(sig_out,acc.to_vector<cint16>(SHIFT));
}
delay_line=buff;
}
void fir_32tap_vector_init()
{
//initialize data
for (int i=0;i<8;i++){
aie::vector<int16,8> tmp=get_wss(0);
delay_line.insert(i,tmp.cast_to<cint16>());
}
};
-
alignas(aie::vector_decl_align)
can be used to ensure data is aligned for vector load and store. - Each iteration of the main loop computes multiple samples. Consequently, the loop count is reduced.
- Data update, calculation and data write are interleaved in
the code. Determining which portion of data buffer
buff
to read is controlled usingdata_start
ofaie::sliding_mul
. - For more information about supported data types and lane
numbers for
aie::sliding_mul
, see AI Engine API User Guide (UG1529).
The initiation interval of the main loop should be identified. To locate the initiation interval of the loop:
- Add the
-v
option to the AI Engine compiler to output a verbose report of kernel compilation. - Open the kernel compilation log, for example,
Work/aie/<COL_ROW>/<COL_ROW>.log
. - In the log, search keywords, such as
do-loop
, to find the initiation interval of the loop.An example result follows:where:HW do-loop #2821 in ".../fir_32tap_vector.cc", line 21: (loop #3) : critical cycle of length 130 : ... minimum length due to resources: 128 scheduling HW do-loop #2821 (algo 2) -> # cycles: ...... NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime -> # cycles: .....183 (exceeds -k 110) -> no folding: 183 -> HW do-loop #2821 in ".../Vitis/2024.1/aietools/include/adf/stream/me/stream_utils.h", line 870: (loop #3) : 183 cycles
- The initiation interval of the loop is 183. This means that a sample is produced in roughly 183/32~=6 cycles.
- The message
(exceeds -k 110) -> no folding
indicates that the scheduler is not attempting software pipelining because the loop cycle count exceeds a limit.
- To override the loop cycle limit, add a user constraint, such as
--Xchess="fir_32tap_vector:backend.mist2.maxfoldk=200"
to the AI Engine compiler.The example result is then as follows:
scheduling HW do-loop #2821 (algo 2) -> # cycles: ...... NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime -> # cycles: .....183 (modulo) -> # cycles: ... ok (required budget ratio: 2) ... (resume algo) -> after folding: 161 (folded over 1 iterations) -> HW do-loop #2821 in ".../Vitis/2024.1/aietools/include/adf/stream/me/stream_utils.h", line 870: (loop #3) : 161 cycles
where, the software requires roughly 161/32~=5 cycles to produce a sample.