Latency of the design is given by the perf counter value read from DUT. The performance counter measures the time taken by the DUT for matrix multiplication in terms of number of clocks.
The following table shows the latency for various matrix sizes per matrix (int16) (1x clocks):
GeMM Configuration |
Data Transfer Size |
Latency in |
Latency (us) |
Matrices/s |
|---|---|---|---|---|
32x32x32 |
1024 |
34 |
0.097 |
10.29 x 10^6 |
64x64x64 |
4096 |
130 |
0.371 |
2.69 x 10^6 |
128x128x128 |
16384 |
1026 |
2.931 |
3.41 x 10^5 |
256x256x256 |
65536 |
8194 |
23.411 |
4.27 x 10^4 |
512x512x512 |
262144 |
65538 |
187.3 |
5.34 x 10^3 |
1024x1024x1024 |
1048576 |
524290 |
1497.8 |
6.67 x 10^2 |
Note: In hw_emu, due to a simulation problem expected data and read data are off by one clock.