Matrix A00 first multiplies with Matrix B00, which is the basic 32x32 matrix multiplication. Over the first 96 clocks, each DSP chain produces 32 outputs, thus total 1K outputs generate which are the partial sums for the final output. The system writes these partial sums to 64 partial sum block RAMs.
After 64 clocks, the first cascade chain completes with A00 B00 submatrix, and it then starts performing A00 B01 to calculate partial sums for the next column of the output matrix. Likewise over next 32 clocks, other DSP cascade chains also complete A00 B00 matrix multiplication and move to A00 B01 submatrix multiplication. This way Matrix A00 multiplies with Matrix B00, B01, B02 through B0,31.
This completes A00 submatrix multiplications. Next, the system reads A01 submatrix of Matrix A, and multiplies it with the submatrices of Matrix B. The partial sums add to the partial sums previous generated, and stored. It moves along the first row of Matrix A and multiplies that submatrix with submatrices of Matrix B. This continues for 32 iterations, and in the 32nd iteration, data is written to output block RAM instead of partial sum block RAM. This completes the computation of the first row of the output matrix.
The next step is to move to the next row of Matrix A and repeat all these steps. After 32 such iterations, 1Kx1Kx1K matrix multiplication is completed.