As shown in the following figure, the CSCMV accelerator implemented on the AMD Alveo™ U280 card consists of a group of CUs (compute units, the instances of AMD Vitis™ Kernels) connected via AXI STREAMs. In this design, 16 (out ot 32) HBM channels are used to store a sparse matrix NNZ values and row indices. Each HBM channel drives a dedicated computation path involving xBarCol
and cscRow
to perform the SpMV operation for the portion of the sparse matrix data stored in this HBM channel. In total, 16 SpMV operations are performed simultaneously for different parts of the sparse matrix data. Thanks to the CSC format storage of the sparse matrix, the input dense vector has a high degree of reusability. This reusability addressed in the bufTransColVec
and bufTransNnzCol
CUs and the low device memory access overhead addressed in the loadCol
CU provide sufficient data throughput to allow the 16 parallel computattion paths to run at 300MHz to achieve highest performance. The highlights of this architecture are:
- Using AXI streams to connect a great number of CUs (37 CUs in this design) to allow massive parallelism being realized in the hardware
- Leveraging different device memories to reduce the memory access overhead and meet the computation paths’ data throughput requirements
- Minimizing the SLR (Super Logic Region) crossing logic to achieve higher clock rate
Although the above hardware architecture offers high computation power, it alone does not guarantee the high system level performance. To achieve that, the sparse matrix data has to be partitioned evenly across the HBM channels. The following paragraghs present the details of the matrix partitioning strategy implemented in the software, the device memory layouts that facilitate the parition metadata decoding, the functionality of the CUs and the steps for building and simulating the design with Vitis 2022.2.