Steps – Version 3 - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

In this version, kernels mean, deviation and norm are combined into a kernel, but the kernel is replicated. Each kernel deals with 1/6 of the data, and they are cascaded with partial accumulation results. The last kernel will compute the “mean” and “deviation” and multicast them back to every kernel. Then, every kernel computes the “normalization” and sends the results to another memtile to combine the results.

The design is in Normalization Version 3

Version 3 Graph View

Look at Normalization Version 3 Graph Code:

  • The memtiles have 6 inputs or outputs. 1/6 of the data are accessed via the offset settings:

const int NUM=6;
mtxA = shared_buffer<bfloat16>::create({COL,ROW}, 1, NUM);
mtxB = shared_buffer<bfloat16>::create({COL,ROW}, NUM, 1);
write_access(mtxA.in[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
read_access(mtxB.out[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
for(int i=0;i<NUM;i++){
	read_access(mtxA.out[i]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={K_COL,K_ROW}, .offset={0,K_ROW*i} });
	write_access(mtxB.in[i]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={K_COL,K_ROW}, .offset={0,K_ROW*i} });
}

Run following command to finish AIE simulation:

make aiesim

Open the simulation result by vitis_analyzer aiesimulator_output/default.aierun_summary, and then click Trace to open trace view. In trace view, by “Filter” button, the kernels and some related nets can be grouped together to have a comprehensive view of application execution:

Version 3 Trace View

Some observations from above running result:

  • Kernel execution is in parallel. The last kernel has an additional summarization task. So, it takes more time than other kernels.

  • The data transferring from memtile to PL takes a much longer time than the kernels’ execution time. Similar for PL input data to memtile.

  • The average graph completion time can be obtained from 1st TLAST to 2nd TLAST in simulation result:

    T 111920 ns
    TLAST
    ......
    T 157808 ns
    TLAST
    
  • The graph throughput can be computed as:

    256*384*2 (Bytes) / (157808-111920)ns = 4284.52 MB/s
    

The design can run through hardware flow. The PL kernels are designed for performance testing purpose. They can just send data and receive data at maximum throughput without affecting AI Engine performance. To build for HW:

make package TARGET=hw

To run in HW:

./host.exe a.xclbin 9999

The result can be like:

Throughput of the graph:4137.26M Bytes/s

Following table summarizes the profiling results of kernels and the graph:

Kernel or graph Cycle or throughput
mean_dev_norm_first 12113 (cycles)
mean_dev_norm_middle 12106 (cycles)
mean_dev_norm_last 21104 (cycles)
Kernel Data Transfer 8192 (cycles/iteration)
Graph Throughput (sim) 4284.52 MB/s
Graph Throughput (HW) 4137.26 MB/s

By above trace analysis and profiling results, the largest bottleneck should be data transferring from or to PL. If the system allows more PL ports to be used, multiple PL ports can transfer data together to improve the system performance. See next version for this optimization.