Matrix Calculation Latency for Large Matrices - Matrix Calculation Latency for Large Matrices - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

32x32 matrix calculation requires 96 clocks. However, the first cascade chain in the DSP58 array completes its computation after 64 clocks, and it can start receiving data for the next submatrix. Thus for 32 clocks, there is an overlap of previous and new submatrix calculations. So, the total number of clocks required for large matrix multiplication is 64 * No. of Submatrices + 32.

In this design, the DSP clock is operating at 700 MHz (1.42 ns). The following figure shows block diagram of the design.

Image of GEMM DSP Implementation Architecture