Version: Vitis 2024.1
This tutorial introduces you on design partitioning into AIE-ML device. By various performance analysis techniques introduced, the design is optimized. The performance is also verified in hardware in each optimization step.
IMPORTANT: Before beginning the tutorial make sure you have installed the Vitis software platform 2024.1. The Vitis release includes all the embedded base platforms including the VEK280 base platform that is used in this tutorial. In addition, ensure that you have downloaded the Common Images for Embedded Vitis Platforms from this link.
The ‘common image’ package contains a prebuilt Linux kernel and root file system that can be used with the AMD Versal™ board for embedded design development using the Vitis tools.
Before starting this tutorial, run the following steps:
Go to the directory where you have unzipped the Versal Common Image package.
In a Bash shell, run the
/**Common Images Dir**/xilinx-versal-common-v2024.1/environment-setup-cortexa72-cortexa53-xilinx-linux
script. This script sets up the SDKTARGETSYSROOT and CXX variables. If the script is not present, you must run the/**Common Images Dir**/xilinx-versal-common-v2024.1/sdk.sh
.Set up your ROOTFS and IMAGE to point to the
rootfs.ext4
and Image files located in the/**Common Images Dir**/xilinx-versal-common-v2024.1
directory.Set up your PLATFORM_REPO_PATHS environment variable to
$XILINX_VITIS/base_platforms
.
This tutorial targets VEK280 board for 2024.1 version.
Objectives
After completing this tutorial, you will be able to:
Construct AI Engine graph and use shared buffers (for AIE-ML memory tiles)
Use simulation to do hang analysis
Use simulation and Vitis Analyzer to do profiling and performance analysis
Learn the concept of design partition and optimization for AIE-ML device
Target Application Introduction
This tutorial targets z-score normalization that scales elements of a frame, making the frame output have $\mu=0$ and $\sigma=1$ distributions (mean=0, deviation=1).
Assume the input frame is a COL * ROW
matrix (data is stored column first). For each element in a frame, it computes the corresponding element as:
$$ {x^{’}}={\frac{x-\mu}{\sigma}} $$
Where:
$$ {\mu}=\sum_{i=0}^{ROW}\sum_{i=0}^{COL}{x} / {(ROW*COL)} $$
$$\sigma=\sqrt{{\sum_{i=0}^{ROW}\sum_{i=0}^{COL}{{(x-\mu)}^2}} / {(ROW*COL-1)}} \approx \sqrt{\sum_{i=0}^{ROW}\sum_{i=0}^{COL}{{(x-\mu)}^2} / {(ROW*COL)}} $$
For designs in the tutorial, following specifications are chosen:
COL=256
ROW=384
Data type: bfloat16
Steps - Version 1
The input frame size is 256*384*2=192 KB
. One memtile is 512 KB, but AIE-ML tile memory has only 64 KB. The input frame is able to be put into a memtile, but not an AIE-ML tile memory. And the same frame data is first to be used to compute the “mean”, and then “deviation”, and “normalization” last.
So, based on the analysis, a design is constructed: Normalization Version 1
The data is transferred to a memtile, and multicasted to three kernels mean
, deviation
and norm
. Kernel mean
calculates the mean value and sends it to deviation
. Kernel deviation
calculates the deviation value and sends it with the mean value to norm
. Kernel norm
will generates the normalization value and sends them out.
Look at Normalization Version 1 Graph Code:
It defines frame sizes: COL=256, ROW=384 (192 KB), and kernel buffer input size: K_COL=256, K_ROW=64 (32 KB, maximum size for PING PONG buffers in a tile):
const int COL=256; const int ROW=384; const int K_COL=256; const int K_ROW=64;
The memtile data is transferred to AIE tile memory via multiple iterations of the kernels. So, the repetition counts of the kernels are
ROW*COL/K_ROW/K_COL = 6
:repetition_count(k_mean)=ROW*COL/K_ROW/K_COL; repetition_count(k_deviation)=ROW*COL/K_ROW/K_COL; repetition_count(k_norm)=ROW*COL/K_ROW/K_COL;
The write access and read access of the memtile is linear. For tiling parameters usage, you may refer to Tiling Parameters Specification.
mtxA = shared_buffer<bfloat16>::create({COL,ROW}, 1, 1); write_access(mtxA.in[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} }); read_access(mtxA.out[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
Look at the kernel
mean
code Normalization Version 1 Mean Kernel Code:The kernel will generate the mean value after 6 iterations of the kernel. So, the output buffer of
mean
is defined as an asynchronous bufferoutput_async_buffer
.__attribute__((noinline))
is added to the kernel function to improve debuggability.template<int COL, int ROW, int REPEAT> __attribute__((noinline)) void mean(input_buffer<bfloat16> & __restrict data, output_async_buffer<bfloat16> & __restrict out){ ...... if(iteration==REPEAT){ out.acquire(); bfloat16* pout=out.data(); *pout=(bfloat16)(aie::reduce_add(acc.to_vector<float>()) / ROW / COL / REPEAT); out.release(); ...... }
A similar concept applies to kernel deviation
(Normalization Version 1 Kernel Deviation Code) and norm
(Normalization Version 1 Kernel Norm Code).
However, the design will hang. Hang detection is supported via multiple design flows. Each has its benefits:
X86 Simulation is quick in the flow. Run following make command:
make x86sim
The log of X86 simulation:
x86simulator: Detected deadlock Deadlock diagnosis: 1. main() is waiting on kernel 'gr.k_mean' because Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]' 2. Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]' because Data unavailable from port 'gr.k_mean.in[0]' 3. Data unavailable from port 'gr.k_mean.in[0]' because Node 'sharedBuf_i5_out0' is blocked while writing port 'gr.k_deviation.in[0]' 4. Node 'sharedBuf_i5_out0' is blocked while writing port 'gr.k_deviation.in[0]' because Unable to write port 'gr.mtxA.out[0]' 5. Unable to write port 'gr.mtxA.out[0]' because Node 'gr.k_deviation' is blocked while reading port 'gr.k_mean.out[0]' 6. Node 'gr.k_deviation' is blocked while reading port 'gr.k_mean.out[0]' because Data unavailable from port 'gr.k_deviation.in[1]' 7. Data unavailable from port 'gr.k_deviation.in[1]' because Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]'
AIE Simulation can give a visualization of the stalls inside the graph. Run following make command:
make aiesim
And Refer to Lock Stall Analysis for steps to analyze the root cause of the hang. The stalls of the kernels are highlighted as: