The number of cycles required for data loading needs to be balanced with the number of compute cycles required by the kernel. This balance is a main consideration when determining window sizes for a design. Balancing helps to pipeline the ping and pong buffer data loading with the kernel compute. For very high memory density designs, it makes sense to have smaller window sizes which can still balance the kernel compute. Larger window sizes can lead to mapper failure.
The following table shows the number of cycles required for the matrix multiplication of two matrices with 16-bit data. Example 1 and Example 2 have different matrix sizes, but both have their compute and data loading balanced.
| Matrix A Size | Matrix B Size | # of Multiplication Operations (MultOps) | #Cycles for Compute 32 ops/ cycle |
#Cycles for Data Loading 32 bits/ cycle |
|
|---|---|---|---|---|---|
| Example 1 | 16x64 | 64x16 | 16384 | 512 (16384/32) |
512 (64x16x16/32) |
| Example 2 | 16x64 | 64x32 | 32768 | 1024 (32768/32) |
1024 (64x32x16/32) |