Different channels of memtile can transfer data to different kernels independently. Three channels of the memtile are used to resolve the hang issue.
An improved version is constructed – Normalization Version 2
Look at Normalization Version 2 Graph Code:
The memtile has 3 outputs. All the access patterns are same:
mtxA = shared_buffer<bfloat16>::create({COL,ROW}, 1, 3);
write_access(mtxA.in[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
read_access(mtxA.out[0]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
read_access(mtxA.out[1]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
read_access(mtxA.out[2]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW}, .offset={0,0} });
Run following command to finish AIE simulation:
make aiesim
Open the simulation result by vitis_analyzer aiesimulator_output/default.aierun_summary
, and then click Trace
to open trace view. In trace view, by “Filter” button, the kernels and some related nets can be grouped together to have a comprehensive view of application execution:
Some observations from the above running result:
Point 3 shows that there are stalls between each kernel’s execution. This indicates that the data transfer to kernel is slower than kernel execution. Similarly, point 4 and 5 show that the data transfer to kernels is slower than kernels’ execution.
Point 3, 4 and 5 shows that the kernels are executing in series.
Point 2 and 3 shows that the data transfer to kernel can be in parallel with kernel execution. Similarly, point 5 and 6 show that the data transfer to output can be in parallel with kernel execution.
Graph throughput can be defined as how much time to complete an iteration of the graph. Each output buffer of a kernel to PLIO have a
TLAST
indication in the simulation output file. This can be utilized to calculate the finish time of graph execution. Since each graph iteration has 6 kernel output buffers, the graph execution time can be from 1stTLAST
to 7thTLAST
. See point 6 as an example. Following are example timestamps for 1stTLAST
to 7thTLAST
:
T 128198400 ps
TLAST
......
T 272822400 ps
TLAST
So, the graph throughput via simulation can be computed as:
256*384*2 (Bytes) / (272822400-128198400)ps = 1359.44 MB/s
The kernel execution time can be profiled by multiple ways, for example:
By utilizing the tile counter in the kernel code: Profiling Kernel Code
Use
--profile
option of AIE simulation to get the function time:
TIP:
Total Function Time
includes only the function execution time, but not its sub-functions.Total Function Time + Descendants Time
includes the function and its sub-functions’ execution time. Both include stall time in function execution.
The design can run through hardware flow. The PL kernels are designed for performance testing purpose. They can just send data and receive data at maximum throughput without affecting AI Engine performance. To build for HW:
make package TARGET=hw
To run in HW:
./host.exe a.xclbin 9999
The result might be similar to the following:
Throughput of the graph:1344.51M Bytes/s
Following table summarizes the profiling results of kernels and the graph:
Kernel or graph | Cycle or throughput |
---|---|
mean | 2088 (cycles) |
deviation | 4921 (cycles) |
norm | 3296 (cycles) |
Kernel Data Transfer | 8192 (cycles/iteration) |
Graph Throughput (sim) | 1359.44 MB/s |
Graph Throughput (HW) | 1344.51 MB/s |
By above trace analysis and profiling results, the kernels can be put into a tile, but replicated to improve the application performance. See how optimization can be done in next version.