Matrix Multiplications - mmul

Matrix Multiplications - mmul - 2022.1 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID

UG1079

Release Date

2022-05-25

Version

2022.1 English

The AI Engine API encapsulates the matrix multiplication functionality in the aie::mmul class template. This class template is parametrized with the matrix multiplication shape (M*K*N), the data types and, optionally, the requested accumulation precision. For the supported shapes, see Matrix Multiplication.

It defines one function for the initial multiplication (mul) and one function for multiply-add (mac). aie::mmul objects can be initialized from vectors or accumulators so that they can be used in chained computations where partial results are sent over the cascade.

The resulting class defines a function that performs the multiplication and a data type for the result that can be converted to an accumulator/vector. The function interprets the input vectors as matrices as described by the shape parameters.

The following is a sample code to compute C(2x64) = A(2x8) * B(8x64) matrix multiplication, using 2*4*8 mode of mmul. One iteration of the loop does, C0(2x8) = A0(2x4) * B0(4x8) + A1(2x4) * B1(4x8), where A0 is left half of A, A1 is right half of A, B0 is upper left 4x8 matrix of B, B1 is lower left 4x8 matrix of B, and C0 is leftmost 2x8 matrix of C.

The data for all matrices are assumed to be row-based in memory. A is read a time into a vector. Thus, it requires some data filtering for mmul. B0 and B1 are read a row (eight elements) a time. Four rows are combined for mmul. The indexes of two rows of C0 need to be calculated and two rows of C0 are written to memory separately.

Note: This example shows usage of mmul. It is not targeted for performance.

//For element mmul
const int M=2;
const int K=4;
const int N=8;
//Total matrix sizes
const int rowA=2;
const int colA=8;
const int colB=64;

__attribute__((noinline)) void matrix_mul(input_window<int16>* __restrict data0, input_window<int16>* __restrict data1, output_window<int16>* __restrict out){
  constexpr size_t sizeTileA = M * K;
  constexpr size_t sizeTileB = K * N;
  constexpr size_t sizeTileC = M * N;
  aie::vector<int16,sizeTileA*2> va=window_readincr_v<sizeTileA*2>(data0);
  //select left half matrix of A into va0
  aie::vector<int16,sizeTileA> va0=aie::filter_even(va,4);
  //select right half matrix of A into va1  
  aie::vector<int16,sizeTileA> va1=aie::filter_odd(va,4);

  input_window<int16> data1_copy_mem;
  input_window<int16>* data1_copy=&data1_copy_mem;
  window_copy(data1_copy,data1); 
  window_incr(data1_copy,256);

  aie::vector<int16,N> vb0_[4];
  aie::vector<int16,N> vb1_[4];
  aie::vector<int16,sizeTileC> vc;

  for(int i=0;i<colB/N;i++)
  chess_prepare_for_pipelining
  {
    for(int j=0;j<4;j++){
      vb0_[j]=window_read_v<8>(data1);
      window_incr(data1,64);
      vb1_[j]=window_read_v<8>(data1_copy);
      window_incr(data1_copy,64);
    }

    aie::mmul<M,K,N,int16,int16> m;
    m.mul(va0,aie::concat(vb0_[0],vb0_[1],vb0_[2],vb0_[3]));
    m.mac(va1,aie::concat(vb1_[0],vb1_[1],vb1_[2],vb1_[3]));
    vc=m.to_vector(15);
    window_write(out,vc.extract<8>(0));
    window_incr(out,64);
    window_write(out,vc.extract<8>(1));
    window_incr(out,72);

    window_incr(data1,264);
    window_incr(data1_copy,264);
  }

}