Throughput is measured in Tera Term Operations Per Second (TOPS). After the host app completes writing matrices A and B, it drives the start signal to DUT. When DUT is done it drives the Done output. A performance counter increments for all the clocks from Start to Done. This counts the number of clocks for which DUT is active.
For the 32x32x32 configuration, two 32x32x32 matrix multiplications are done. For each matrix, 64K MAC operations are performed, giving a total 64K * 2 = 128K MACs. If performance counter reaches value X, that means at operating frequency of 350 MHz (period of 2.857 ns), total time taken by DUT = 2.857 x X ns
Thus TOPS = 128K MACs / (2.857 x X) ns
For the rest of the configurations, one matrix multiplication is done.
Configuration |
MACs |
TOPS Calculation |
|---|---|---|
64x64x64 |
512K |
512K MACs / (2.857 x X) ns |
128x128x128 |
4096K |
4096K MACs / (2.857 x X) ns |
256x256x256 |
32768K |
32768K MACs / (2.857 x X) ns |
512x512x512 |
262144K |
262144K MACs / (2.857 x X) ns |
1024x1024x1024 |
2097152K |
2097152K MACs / (2.857 x X) ns |