If the system allows, the data is split into 3 portions that are to be transferred via 3 ports. See the design in Normalization Version 4:
Look at the memtile settings in Normalization Version 4 Graph Code:
const int PLIO_NUM=3;
mtxA = shared_buffer<bfloat16>::create({COL,ROW}, PLIO_NUM, NUM);
mtxB = shared_buffer<bfloat16>::create({COL,ROW}, NUM, PLIO_NUM);
for(int i=0;i<PLIO_NUM;i++){
write_access(mtxA.in[i]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW/PLIO_NUM}, .offset={0,ROW/PLIO_NUM*i} });
read_access(mtxB.out[i]) = tiling({.buffer_dimension={COL,ROW}, .tiling_dimension={COL,ROW/PLIO_NUM}, .offset={0,ROW/PLIO_NUM*i} });
}
Run following command to finish AIE simulation:
make aiesim
Open the simulation result by vitis_analyzer aiesimulator_output/default.aierun_summary
, and then click Trace
to open trace view. In trace view, by “Filter” button, the kernels and some related nets can be grouped together to have a comprehensive view of application execution:
Some observations from above running result:
Kernels’ executions are in parallel with the input and output data transferring.
The last kernel takes more time than other kernels.
The average graph completion time can be obtained from the 1st
TLAST
to 2ndTLAST
in one of the simulation outputs:T 58873600 ps TLAST ...... T 78547200 ps TLAST
The graph throughput can be computed as:
256*384*2 (Bytes) / (78547200-58873600)ps = 9993.49 MB/s
The design can run through hardware flow. The PL kernels are designed for performance testing purpose. They can just send data and receive data at maximum throughput without affecting AI Engine performance. To build for HW:
make package TARGET=hw
To run in HW:
./host.exe a.xclbin 9999
The result might be similar to the following:
Throughput of the graph:9728.82M Bytes/s
Following table summarizes the profiling results of kernels and the graph:
Kernel or graph | Cycle or throughput |
---|---|
mean_dev_norm_first | 9056 (cycles) |
mean_dev_norm_middle | 9045 (cycles) |
mean_dev_norm_last | 19113 (cycles) |
Kernel Data Transfer | 8192 (cycles/iteration) |
Graph Throughput (sim) | 9993.49 MB/s |
Graph Throughput (HW) | 9728.82 MB/s |
NOTE: The kernel performance is improved because default xlopt level is used in version 4. In previous versions,
--xlopt=0
is added to improve debuggability.