The input frame size is 256*384*2=192 KB
. One memtile is 512 KB, but AIE-ML tile memory has only 64 KB. The input frame is able to be put into a memtile, but not an AIE-ML tile memory. And the same frame data is first to be used to compute the “mean”, and then “deviation”, and “normalization” last.
So, based on the analysis, a design is constructed: Normalization Version 1
The data is transferred to a memtile, and multicasted to three kernels mean
, deviation
and norm
. Kernel mean
calculates the mean value and sends it to deviation
. Kernel deviation
calculates the deviation value and sends it with the mean value to norm
. Kernel norm
will generates the normalization value and sends them out.
Look at Normalization Version 1 Graph Code:
It defines frame sizes: COL=256, ROW=384 (192 KB), and kernel buffer input size: K_COL=256, K_ROW=64 (32 KB, maximum size for PING PONG buffers in a tile):
const int COL=256; const int ROW=384; const int K_COL=256; const int K_ROW=64;
The memtile data is transferred to AIE tile memory via multiple iterations of the kernels. So, the repetition counts of the kernels are
ROW*COL/K_ROW/K_COL = 6
:repetition_count(k_mean)=ROW*COL/K_ROW/K_COL; repetition_count(k_deviation)=ROW*COL/K_ROW/K_COL; repetition_count(k_norm)=ROW*COL/K_ROW/K_COL;
The write access and read access of the memtile is linear. For tiling parameters usage, you may refer to Tiling Parameters Specification.
mtxA = shared_buffer<bfloat16>::create({COL,ROW}, 1, 1); write_access(mtxA.in[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} }); read_access(mtxA.out[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
Look at the kernel
mean
code Normalization Version 1 Mean Kernel Code:The kernel will generate the mean value after 6 iterations of the kernel. So, the output buffer of
mean
is defined as an asynchronous bufferoutput_async_buffer
.__attribute__((noinline))
is added to the kernel function to improve debuggability.template<int COL, int ROW, int REPEAT> __attribute__((noinline)) void mean(input_buffer<bfloat16> & __restrict data, output_async_buffer<bfloat16> & __restrict out){ ...... if(iteration==REPEAT){ out.acquire(); bfloat16* pout=out.data(); *pout=(bfloat16)(aie::reduce_add(acc.to_vector<float>()) / ROW / COL / REPEAT); out.release(); ...... }
A similar concept applies to kernel deviation
(Normalization Version 1 Kernel Deviation Code) and norm
(Normalization Version 1 Kernel Norm Code).
However, the design will hang. Hang detection is supported via multiple design flows. Each has its benefits:
X86 Simulation is quick in the flow. Run following make command:
make x86sim
The log of X86 simulation:
x86simulator: Detected deadlock Deadlock diagnosis: 1. main() is waiting on kernel 'gr.k_mean' because Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]' 2. Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]' because Data unavailable from port 'gr.k_mean.in[0]' 3. Data unavailable from port 'gr.k_mean.in[0]' because Node 'sharedBuf_i5_out0' is blocked while writing port 'gr.k_deviation.in[0]' 4. Node 'sharedBuf_i5_out0' is blocked while writing port 'gr.k_deviation.in[0]' because Unable to write port 'gr.mtxA.out[0]' 5. Unable to write port 'gr.mtxA.out[0]' because Node 'gr.k_deviation' is blocked while reading port 'gr.k_mean.out[0]' 6. Node 'gr.k_deviation' is blocked while reading port 'gr.k_mean.out[0]' because Data unavailable from port 'gr.k_deviation.in[1]' 7. Data unavailable from port 'gr.k_deviation.in[1]' because Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]'
AIE Simulation can give a visualization of the stalls inside the graph. Run following make command:
make aiesim
And Refer to Lock Stall Analysis for steps to analyze the root cause of the hang. The stalls of the kernels are highlighted as:
If the hang is not shown in simulation, but only in hardware, the AIE status report can be used for analysis. Run the following make command to build the SD card image:
make package TARGET=hw
And refer to AIE status report for steps to analyze the root cause of the hang. The status in hardware is like:
From the above hang status in HW, you can see how each kernel is stalled. The kernel mean
cannot generate “mean” because it does not receive 6 input buffers. Memtile cannot multicast all the data to multiple kernels, because deviation
and norm
are only capable to store 2 input buffers and then stalled.
To break down the dependency of the input data of the kernels, it’s able to utilize 3 different channels of the memtile. See the solution in next version.