Architecture - 2023.2 English

Vitis Libraries

Release Date
2023.2 English

From the algorithm, we know that the core part of the computation is two columns of data read-and-accumulate, and then updating corresponding two columns of data in Matrix A and V. In this library, we implement this core module in the following architecture.


It can be seen from the architecture, steps 2-5 (each pair of i and j) of the algorithm is divided into three stages:

Stage 1:
  1. read two columns of data of A to BRAM and accumulate to \(b_{ii}\), \(b_{jj}\) and \(b_{ij}\).
  2. preload two columns of data of matrix V to BRAM.
Stage 2:
Calculate SVD for \(2 \times 2\) matrix
Stage 3:
  1. Update two columns of data in matrix A
  2. Update two columns of data in matrix V.
  3. Meanwhile, calculate converage for current pair (i, j).

Since operating data of matrix A and V are independent, two modules of stage 1 are running in parallel. Meanwhile, thress modules of stage 3 run in parallel. The last module of stage 3 calculates converage using \(2 \times 2\) matrix data. This converage computing process is in read-and-accu module of stage 1 according to the algorithm. However, it requires ~60 cycles, which is also a lot after partitioning matrix A by row. Therefore, this calculation process is extracted as a submodule in stage 3.


Why updating matrix V is divided into two modules?

From the figure, we can see that there are two modules related to matrix V, preload two columns of data of V to BRAM, and updating V. In our design, matrix A and V are all saved in URAM. And for each URAM, only 2 ports of read/write are supported. Since matrix A cummulated \(2 \times 2\) data need 100+ cycles to do SVD. We may preload two columns of V into BRAMs via 2-reading ports of URAM. And using two writting ports when updating data in V.

Besides, in order to speed up the data reading and updating of matrix V data, the matrix V is partitioned by NCU through its row. For each CU, matrix V is read/written using 2 URAM ports.


Supported data size:

The supported maximum size of matrix A that templated by NRMAX and NCMAX is 512. The partitioning number MCU and NCU can support up to 16, respectively.