In this version, kernels mean
, deviation
and norm
are combined into a kernel, but the kernel is replicated. Each kernel deals with 1/6 of the data, and they are cascaded with partial accumulation results. The last kernel will compute the “mean” and “deviation” and multicast them back to every kernel. Then, every kernel computes the “normalization” and sends the results to another memtile to combine the results.
The design is in Normalization Version 3
Look at Normalization Version 3 Graph Code:
The memtiles have 6 inputs or outputs. 1/6 of the data are accessed via the
offset
settings:
const int NUM=6;
mtxA = shared_buffer<bfloat16>::create({COL,ROW}, 1, NUM);
mtxB = shared_buffer<bfloat16>::create({COL,ROW}, NUM, 1);
write_access(mtxA.in[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
read_access(mtxB.out[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
for(int i=0;i<NUM;i++){
read_access(mtxA.out[i]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={K_COL,K_ROW}, .offset={0,K_ROW*i} });
write_access(mtxB.in[i]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={K_COL,K_ROW}, .offset={0,K_ROW*i} });
}
Run following command to finish AIE simulation:
make aiesim
Open the simulation result by vitis_analyzer aiesimulator_output/default.aierun_summary
, and then click Trace
to open trace view. In trace view, by “Filter” button, the kernels and some related nets can be grouped together to have a comprehensive view of application execution:
Some observations from above running result:
Kernel execution is in parallel. The last kernel has an additional summarization task. So, it takes more time than other kernels.
The data transferring from memtile to PL takes a much longer time than the kernels’ execution time. Similar for PL input data to memtile.
The average graph completion time can be obtained from 1st
TLAST
to 2ndTLAST
in simulation result:T 111920 ns TLAST ...... T 157808 ns TLAST
The graph throughput can be computed as:
256*384*2 (Bytes) / (157808-111920)ns = 4284.52 MB/s
The design can run through hardware flow. The PL kernels are designed for performance testing purpose. They can just send data and receive data at maximum throughput without affecting AI Engine performance. To build for HW:
make package TARGET=hw
To run in HW:
./host.exe a.xclbin 9999
The result can be like:
Throughput of the graph:4137.26M Bytes/s
Following table summarizes the profiling results of kernels and the graph:
Kernel or graph | Cycle or throughput |
---|---|
mean_dev_norm_first | 12113 (cycles) |
mean_dev_norm_middle | 12106 (cycles) |
mean_dev_norm_last | 21104 (cycles) |
Kernel Data Transfer | 8192 (cycles/iteration) |
Graph Throughput (sim) | 4284.52 MB/s |
Graph Throughput (HW) | 4137.26 MB/s |
By above trace analysis and profiling results, the largest bottleneck should be data transferring from or to PL. If the system allows more PL ports to be used, multiple PL ports can transfer data together to improve the system performance. See next version for this optimization.