After running AIE simulation with 64x64 matrices, we can look at the profiling results with:
make OPT=0 aieviz
This opens up vitis analyzer
with the run summary displayed. The profile tab is opened clicking on the last section Profile:
There are 2 tiles which contain a kernel:
Column 10: the kernel output data type is int32
Column 20: the kernel output data type is int16
Let start with the first one which is outputing int32 data. The Total Function Time tab will inform us on the number of cycles necessary to compute this matrix multiply:
We can see that the number of cycles to run the entire function is 2092 cycles. If we want to compute the vector processor usage efficiency we have to use the following data:
number of multiplications to perform: 64 x 64 x 64
number of parallel int8 x int8 multiplications in the SIMD vector processor: 256
64 x 64 x 64
Efficiency = ------------ = 0.49
2092 x 256
This efficiency is not very high and we will see how to improve it in the next part of this tutorial. Anyway we can have a look to the assembly code to verify why we are at this level of performance. The Profile Details tab gives you access to this code:
The inner loop is run 360 times (4 Iterations) and we can see how many VMUL and VMAC operations it contains: 8 VMUL/VMAC instructions out of 16 lines which is close to the 50% efficiency computed above.
An equivalent efficiency can be computed from the 16 bits version of the kernel as the kernel duration is 2089 cycles.