One of the main considerations when determining the window sizes for a design is that the number of cycles required for data loading is balanced with the number of compute cycles required by the kernel. This helps to pipeline the ping and pong buffer data loading with the kernel compute. For very high memory density designs, it makes sense to have smaller window sizes which can still balance the kernel compute because having larger window sizes might lead to mapper failure.
The following table shows the number of cycles required for the matrix multiplication of two matrices with 16-bit data. Example 1 and Example 2 have different matrix sizes, but both have their compute and data loading balanced. Note that only the larger of the A or B matrix size determines the data loading time whereas the time of kernel compute is determined by both sizes. This shows that Example 1 has smaller window sizes than Example 2, but the compute and data loading are balanced and can be pipelined.
Matrix A Size | Matrix B Size | # of Multiplication Operations (MultOps) | #Cycles for Compute 32 ops/ cycle |
#Cycles for Data Loading 32 bits/ cycle |
|
---|---|---|---|---|---|
Example 1 | 16x64 | 64x16 | 16384 | 512 (16384/32) |
512 (64x16x16/32) |
Example 2 | 16x64 | 64x32 | 32768 | 1024 (32768/32) |
1024 (64x32x16/32) |