In this design, matrix multiplication is implemented using a DSP58 systolic array of size 32x32. This means that there are 32 DSP58 cascade chains, and each chain has 32 DSP58s. Thus, the 32x32 matrix is the basic matrix multiplication size. Larger matrices are broken down into submatrices of size 32x32.
Basic 32x32 multiplication is performed as follows:
Matrix A row data moves upwards along DSP A Port cascade chain.
For the first 32 clocks, data is only shifted into DSP chains.
After 32 clocks, row 0 of matrix A is populated in the first DSP cascade chain.
Row 1 is populated in the next cascade chain and so on.
This following figure illustrated this process.