Matrix Multiplications - mmul - 2025.1 English - UG1603

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2025-05-29
Version
2025.1 English

The AI Engine API encapsulates the matrix multiplication functionality in the aie::mmul class template. This class template is parameterized with the matrix multiplication shape (M*K*N), the data types and, optionally, the requested accumulation precision. For the supported shapes, see Matrix Multiplication in the AI Engine API User Guide (UG1529).

It defines one function for the initial multiplication (mul) and one function for multiply-add (mac). aie::mmul objects can be initialized from vectors or accumulators so that they can be used in chained computations where partial results are sent over the cascade.

The resulting class defines a function that performs the multiplication and a data type for the result that can be converted to an accumulator/vector. The function interprets the input vectors as matrices as described by the shape parameters.

The following is a sample code to compute a C(2x64) = A(2x8) * B(8x64) matrix multiplication, using 2*4*8 mode of mmul. One iteration of the loop does C0(2x8) = A0(2x4) * B0(4x8) + A1(2x4) * B1(4x8), where A0 is left half of A, A1 is right half of A, B0 is upper left 4x8 matrix of B, B1 is lower left 4x8 matrix of B, and C0 is leftmost 2x8 matrix of C.

The data for all matrices are assumed to be row-based in memory. Matrix A is read into a vector, per instructions. Thus, it requires some data filtering for mmul. B0 and B1 are read a row (eight elements) at a time. Four rows are combined for mmul. The indexes of two rows of C0 need to be calculated and two rows of C0 are written to memory separately.

The matrix multiplication is illustrated as follows:

Figure 1. Matrix Multiplication Example

And the corresponding example code is as follows:

Note: This example shows usage of mmul, which is not intended for performance optimization.
#include <aie_api/aie.hpp>
#include "aie_api/utils.hpp"
//For element mmul
const int M=2;
const int K=4;
const int N=8;
//Total matrix sizes
const int rowA=2;
const int colA=8;
const int colB=64;
const int SHIFT_BITS=0;
//Derived parameters
const int HALF_B_NUM=8*64/2;

using MMUL = aie::mmul<M, K, N, int16, int16>;
__attribute__((noinline)) void matmul_mmul(adf::input_buffer<int16>& __restrict data0, adf::input_buffer<int16>& __restrict data1, adf::output_buffer<int16>& __restrict out){
  auto pa=aie::begin_vector<MMUL::size_A*2>(data0);
  aie::vector<int16,MMUL::size_A*2> va=*pa;
  //select left half matrix of A into va0
  aie::vector<int16,MMUL::size_A> va0=aie::filter_even(va,4);
  //select right half matrix of A into va1
  aie::vector<int16,MMUL::size_A> va1=aie::filter_odd(va,4);
  auto pb0=aie::begin_vector<N>(data1);
  //Note that every accumulation on pointer advances N(=8) elements
  auto pb1=pb0+HALF_B_NUM/N;
  aie::vector<int16,N> vb0_[4];
  aie::vector<int16,N> vb1_[4];
  aie::vector<int16,MMUL::size_C> vc;
  auto pc=aie::begin_vector<N>(out);
  LOOP_on_COL:for(int i=0;i<colB/N;i++)
  chess_prepare_for_pipelining
  {
    for(int j=0;j<4;j++){
      vb0_[j]=*pb0;
      pb0+=colB/N;
      vb1_[j]=*pb1;
      pb1+=colB/N;
    }
    MMUL m;
    m.mul(va0,aie::concat(vb0_[0],vb0_[1],vb0_[2],vb0_[3]));
    m.mac(va1,aie::concat(vb1_[0],vb1_[1],vb1_[2],vb1_[3]));
    vc=m.to_vector<int16>(SHIFT_BITS);//right shift SHIFT_BITS
    *pc=vc.extract<N>(0);
    //Note that every accumulation on pointer advances N(=8) elements
    pc+=colB/N;
    *pc=vc.extract<N>(1);
    pc-=(colB/N-1);
    pb0-=(colB/N*4-1);
    pb1-=(colB/N*4-1);
  }
}

Matrix Multiplication in AI Engine-ML v2

Matrix Multiplication is also handled in AIE-ML v2. The same data types can be used but with other form factors. float16 data type matrices can also be used with another float16 matrix or with a bfloat16 matrix.

Block floating-point matrix multiplications have constraints deriving from the data type:

  • The number of columns of the first matrix must be 16 to satisfy MX block size.
  • The number of rows of the second matrix must be 16 to satisfy MX block size.

This results in a matrix multiplication where the first matrix has 4 or 8 rows and 16 columns, and the second matrix is always 16 x 16. This aligns with the size of the accumulators (1024 or 2048 bits) as the resulting matrix form factor is either 4x16 or 8x16, which requires a 64 float accumulator (1024 bits) or a 128 float accumulator (2048 bits).

Block floating-point matrix multiplication always uses the transpose operator for the second operand, leading to block floating-point vector dot products computations.

The following example is a block floating-point matrix multiplication with form factor 8x16x16. The involved submatrices are 4x16 for the first operand (A) and 16x16 for the second operand (B):

Figure 2. Matrices Form Factor Generated by Your Tool

To simplify the code, the matrices are stored in a specific order in memory.

Figure 3. Matrices Storage Order Generated by Your Tool

Each sub-matrix is composed of a number of block floating-point vectors that are stored sequentially in memory:

Figure 4. Sub-Matrices Storage order Generated by Your Tool

The following code is not intended for optimization:


#include <aie_api/aie.hpp>
#include "aie_api/utils.hpp"
// Matrix A is matMxmatK, Matrix B is matNxmatK, compute C = A.transpose(B)
#define matM 8
#define matK 32
#define matN 48
// The selected form factor is 4x16x16
#define subM 4
#define subK 16
#define subN 16
// Buffer size in bytes
#define BufASize (matM*matK*18/16)
#define BufBSize (matN*matK*18/16)
#define BufCSize (matM*matN*18/16)

void MxMatMult(
    adf::input_buffer<uint8, adf::extents<BufASize>> & __restrict inA,
    adf::input_buffer<uint8, adf::extents<BufBSize>> & __restrict inB,
    adf::output_buffer<uint8, adf::extents<BufCSize>> & __restrict outC
)
{
    // Instanciate the matrix multiply
    aie::mmul<subM, subK, subN, mx9, mx9, accfloat> m;
    // Matrix A is stored in blocks of 4x16 row major
    // Matrix B is stored in blocks of 16x16 row major
    // Matrix C is stored in blocks of 4x16 row major
    aie::block_vector<mx9,64> va;
    aie::block_vector<mx9,256> vb;

    aie::block_vector_output_buffer_stream<mx9, 64> outc_bufs((mx9 *)outC.data());

    // Loop on C output matrix block result
    for(int row = 0; row<(matM/subM); row++)
    {
        // Block index for matrix A : row*(matK/subK)
        aie::block_vector_input_buffer_stream<mx9, 64> ina_bufs((mx9 *)(inA.data() + row*(matK/16) * subM * 18));

        // Block index for matrix B : col*(matK/subK)
        aie::block_vector_input_buffer_stream<mx9, 256> inb_bufs((mx9 *)inB.data());

        for (int col = 0; col < (matN / subN); col++)
        {

            ina_bufs >> va;
            inb_bufs >> vb;
            m.mul(va, aie::op_transpose(vb));
            
            for (int inner = 0; inner < (matK / subK); inner++)
            {
                ina_bufs >> va;
                inb_bufs >> vb;
                m.mac(va, aie::op_transpose(vb));
            }

            aie::block_vector<mx9,64> vc = m.to_vector<mx9>();
            outc_bufs << vc;
        }
    }
}