AIE-ML Performance Analysis - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
XD100
Release Date
2024-10-30
Version
2024.1 English

Version: Vitis 2024.1

This tutorial introduces you on design partitioning into AIE-ML device. By various performance analysis techniques introduced, the design is optimized. The performance is also verified in hardware in each optimization step.

IMPORTANT: Before beginning the tutorial make sure you have installed the Vitis software platform 2024.1. The Vitis release includes all the embedded base platforms including the VEK280 base platform that is used in this tutorial. In addition, ensure that you have downloaded the Common Images for Embedded Vitis Platforms from this link.

The ‘common image’ package contains a prebuilt Linux kernel and root file system that can be used with the AMD Versal™ board for embedded design development using the Vitis tools.

Before starting this tutorial, run the following steps:

  1. Go to the directory where you have unzipped the Versal Common Image package.

  2. In a Bash shell, run the /**Common Images Dir**/xilinx-versal-common-v2024.1/environment-setup-cortexa72-cortexa53-xilinx-linux script. This script sets up the SDKTARGETSYSROOT and CXX variables. If the script is not present, you must run the /**Common Images Dir**/xilinx-versal-common-v2024.1/sdk.sh.

  3. Set up your ROOTFS and IMAGE to point to the rootfs.ext4 and Image files located in the /**Common Images Dir**/xilinx-versal-common-v2024.1 directory.

  4. Set up your PLATFORM_REPO_PATHS environment variable to $XILINX_VITIS/base_platforms.

This tutorial targets VEK280 board for 2024.1 version.

Objectives

After completing this tutorial, you will be able to:

  • Construct AI Engine graph and use shared buffers (for AIE-ML memory tiles)

  • Use simulation to do hang analysis

  • Use simulation and Vitis Analyzer to do profiling and performance analysis

  • Learn the concept of design partition and optimization for AIE-ML device

Target Application Introduction

This tutorial targets z-score normalization that scales elements of a frame, making the frame output have $\mu=0$ and $\sigma=1$ distributions (mean=0, deviation=1).

Assume the input frame is a COL * ROW matrix (data is stored column first). For each element in a frame, it computes the corresponding element as:

$$ {x^{’}}={\frac{x-\mu}{\sigma}} $$

Where:

$$ {\mu}=\sum_{i=0}^{ROW}\sum_{i=0}^{COL}{x} / {(ROW*COL)} $$

$$\sigma=\sqrt{{\sum_{i=0}^{ROW}\sum_{i=0}^{COL}{{(x-\mu)}^2}} / {(ROW*COL-1)}} \approx \sqrt{\sum_{i=0}^{ROW}\sum_{i=0}^{COL}{{(x-\mu)}^2} / {(ROW*COL)}} $$

For designs in the tutorial, following specifications are chosen:

  • COL=256

  • ROW=384

  • Data type: bfloat16

Steps - Version 1

The input frame size is 256*384*2=192 KB. One memtile is 512 KB, but AIE-ML tile memory has only 64 KB. The input frame is able to be put into a memtile, but not an AIE-ML tile memory. And the same frame data is first to be used to compute the “mean”, and then “deviation”, and “normalization” last.

So, based on the analysis, a design is constructed: Normalization Version 1

Version 1 Graph View

The data is transferred to a memtile, and multicasted to three kernels mean, deviation and norm. Kernel mean calculates the mean value and sends it to deviation. Kernel deviation calculates the deviation value and sends it with the mean value to norm. Kernel norm will generates the normalization value and sends them out.

Look at Normalization Version 1 Graph Code:

  • It defines frame sizes: COL=256, ROW=384 (192 KB), and kernel buffer input size: K_COL=256, K_ROW=64 (32 KB, maximum size for PING PONG buffers in a tile):

    const int COL=256;
    const int ROW=384;
    const int K_COL=256;
    const int K_ROW=64;
    
  • The memtile data is transferred to AIE tile memory via multiple iterations of the kernels. So, the repetition counts of the kernels are ROW*COL/K_ROW/K_COL = 6:

    repetition_count(k_mean)=ROW*COL/K_ROW/K_COL;
    repetition_count(k_deviation)=ROW*COL/K_ROW/K_COL;
    repetition_count(k_norm)=ROW*COL/K_ROW/K_COL;
    
  • The write access and read access of the memtile is linear. For tiling parameters usage, you may refer to Tiling Parameters Specification.

    mtxA = shared_buffer<bfloat16>::create({COL,ROW}, 1, 1);
    write_access(mtxA.in[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
    read_access(mtxA.out[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
    

    Look at the kernel mean code Normalization Version 1 Mean Kernel Code:

  • The kernel will generate the mean value after 6 iterations of the kernel. So, the output buffer of mean is defined as an asynchronous buffer output_async_buffer.

  • __attribute__((noinline)) is added to the kernel function to improve debuggability.

    template<int COL, int ROW, int REPEAT>
    __attribute__((noinline)) void mean(input_buffer<bfloat16> & __restrict data, output_async_buffer<bfloat16> & __restrict out){
    	......
    	if(iteration==REPEAT){
    		out.acquire();
    		bfloat16* pout=out.data();
    		*pout=(bfloat16)(aie::reduce_add(acc.to_vector<float>()) / ROW / COL / REPEAT);
    		out.release();
    	......
    }
    

A similar concept applies to kernel deviation (Normalization Version 1 Kernel Deviation Code) and norm (Normalization Version 1 Kernel Norm Code).

However, the design will hang. Hang detection is supported via multiple design flows. Each has its benefits:

  1. X86 Simulation is quick in the flow. Run following make command:

    make x86sim
    

    The log of X86 simulation:

    x86simulator: Detected deadlock
    Deadlock diagnosis:
      1. main() is waiting on kernel 'gr.k_mean'
         because Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]'
      2. Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]'
         because Data unavailable from port 'gr.k_mean.in[0]'
      3. Data unavailable from port 'gr.k_mean.in[0]'
         because Node 'sharedBuf_i5_out0' is blocked while writing port 'gr.k_deviation.in[0]'
      4. Node 'sharedBuf_i5_out0' is blocked while writing port 'gr.k_deviation.in[0]'
         because Unable to write port 'gr.mtxA.out[0]'
      5. Unable to write port 'gr.mtxA.out[0]'
         because Node 'gr.k_deviation' is blocked while reading port 'gr.k_mean.out[0]'
      6. Node 'gr.k_deviation' is blocked while reading port 'gr.k_mean.out[0]'
         because Data unavailable from port 'gr.k_deviation.in[1]'
      7. Data unavailable from port 'gr.k_deviation.in[1]'
         because Node 'gr.k_mean' is blocked while reading port 'gr.k_mean.in[0]'
    
  2. AIE Simulation can give a visualization of the stalls inside the graph. Run following make command:

    make aiesim
    

    And Refer to Lock Stall Analysis for steps to analyze the root cause of the hang. The stalls of the kernels are highlighted as: