As illustrated in the figure below, the matrix partitioning algorithm implemnted in the software includes 3 levels of partitioning.
- Partition the matrix along the rows into row blocks. Each row block has less than 4K (configured by SPARSE_maxRows) rows.
- Partiton each row block along the column into partitions. Each partition has less than 4K (configured by SPARSE_maxCols) cols.
- Each partition is divided equally into 16 (configured by SPARSE_hbmChannels) parts, called channel partitions.
- The number of NNZs in each row of the channel partition is padded to multiple of 32 to accommodate double precision accumulation latency (8 cycles, each cycle 4 double precision data entries are processed by
selMultX
CU). - Data in each channel partition are stored in row-major order.
Each time a selMultX
CU is triggered, a channel partition is processed. Each computation path (16 in total) in the rowAcc
CU processes all row blocks for a specific HBM channel.