Sparse Matrix Multiplication - 2024.2 English - UG1603

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2024-11-28
Version
2024.2 English

A sparse matrix is a special type of matrix where the majority of its elements are zero. The sparsity of a matrix is determined by the ratio of zero elements to the total number of elements. AI Engine-ML provides hardware support for sparse matrices. For a matrix multiplication C=A*B, matrix B can be sparse. AI Engine API aie::mmul supports sparse matrix multiplication, and are described in detail in the Matrix Multiplication section in the AI Engine API User Guide (UG1529).

The sparse matrix multiplication requires the matrix sparsity to be a minimum of 50%. For data types that are smaller or equal to 8 bits, to be considered sparse, every four consecutive bytes require at least two zeros. For data type that are larger than 8 bits, every four consecutive data samples (for example, int16) require at least two zero data samples.

The sparse data can be compressed in 512-bit chunks and stored in data memory. The first 64 bits are mask bits. Each mask bit corresponds to one byte in the original data sequentially. If the byte is zero, the mask bit is 0. If the byte is non-zero, the corresponding bit is 1, and the byte is appended to the compressed data. After the 512 bits data is compressed, guard bits are added to make sure that the compressed data is 32-bit aligned. Following is an example showing how a sparse data is compressed:

Figure 1. Data Compression Example

The AI Engine API aie::sparse_vector_input_buffer_stream reads and decompresses the compressed data stream into aie::sparse_vector. The vector aie::sparse_vector can be used in aie::mmul to do sparse matrix multiplication. For more information about the sparse data format and API aie::sparse_vector_input_buffer_stream usage, see Memory in the AI Engine API User Guide (UG1529)

The following is an example code. This API assumes that the input data (sparseB) has already been compressed in the producer (for example, PL kernel):
#include <aie_api/aie.hpp>
#include <aie_api/aie_adf.hpp>
#include <aie_api/utils.hpp>
using namespace adf;
const int M=2;
const int K=16;
const int N=8;

alignas(aie::vector_decl_align) const int16 dataA[32]={......};
template<int SIZE>
__attribute__ ((noinline)) void sparse_mmul(
   input_buffer<int8,extents<SIZE>> & __restrict sparseB,
   output_buffer<int16,extents<SIZE>> & __restrict out){

    auto outIter=aie::begin_vector<16>(out);
    using MMUL = aie::mmul<M, K, N, int16, int8>;
    aie::vector<int16,32> matA=aie::load_v<32>(dataA);

    //sparseB is the input data, which should already have been compressed
    auto vbs = aie::sparse_vector_input_buffer_stream<int8, 128>((int8*)sparseB.data());
    ......
    MMUL m;
    for(int i=0;i<MAX_ITER;i++){
      aie::sparse_vector<int8,128> matB = vbs.pop();

      //sparse matrix multiply
      m.mul(matA,matB);
      *outIter++=m.to_vector<int16>(0);
    }
}
Note: The sparse matrix is compressed or decompressed column-wise (column by column).