Double Precision SpMV Overview - 2024.2 English

Vitis Libraries

Document ID
XD160
Release Date
2024-11-29
Version
2024.2 English

Terms and Conditions.

As shown in the following figure, the double precision accelerator implemented on the AMD Alveo™ U280 card consists of a group of CUs (compute units, the instances of AMD Vitis™ Kernels) connected via AXI STREAMs. In this design, 16 (out ot 32) HBM channels are used to store a sparse matrix NNZ values and their indices. Each HBM channel drives a dedicated computation path involving selMultX and rowAcc to perform the SpMV operation for the portion of the sparse matrix data stored in this HBM channel. In total, 16 SpMV operations are performed simultaneously for different parts of the sparse matrix data.

double precision SpMV architecture

The task of partitioning sparse matrix is done on host via Python code. The information about the partitions are stored in two HBM channels, namely partition parameter store and row block parameter store. The loadParX kernel loads input dense vector and partition information from 2 HBM channels, passes the data to fwdParParam and moveX kernel to distribute the partition information and the X vector to 16 selMultX CUs. The NNZ value and indices information are loaded by the loadNnz CU and distributed to 16 selMultX CUs. The results of selMultX CUs are accumulated in rowAcc CU and assembled by assembleY CU. Finally, the result vector Y is stored to a HBM channel by storeY CU.

The highlights of this architecture are:

  • Using AXI streams to connect a number of CUs (24 CUs in this design) to allow massive parallelism being realized in the hardware
  • Using free-run kernel to remove embedded loops and simplify the logic
  • Leveraging different device memories to reduce the memory access overhead and meet the computation paths’ data throughput requirements
  • Minimizing the SLR (Super Logic Region) crossing logic to achieve higher clock rate

Although the above hardware architecture offers high computation power, it alone doesn’t provide a guarantee for the high system level performance. To achieve that, the sparse matrix data has to be partitioned evenly across the HBM channels. The following paragraghs present the details of the matrix partitioning strategy implemented in the Python code.