The AI Engine API
encapsulates the matrix multiplication functionality in the aie::mmul
class template. This class template is parameterized with the matrix
multiplication shape (M*K*N), the data
types and, optionally, the requested accumulation precision. For the supported shapes, see
Matrix Multiplication in the
AI
Engine API User Guide (UG1529).
It defines one function for the initial multiplication (mul
) and one function for multiply-add (mac
). aie::mmul
objects can be initialized from
vectors or accumulators so that they can be used in chained computations where partial
results are sent over the cascade.
The resulting class defines a function that performs the multiplication and a data type for the result that can be converted to an accumulator/vector. The function interprets the input vectors as matrices as described by the shape parameters.
The following is a sample code to compute a C(2x64) = A(2x8) * B(8x64) matrix multiplication, using
2*4*8 mode of mmul
. One iteration of the loop does
C0(2x8) = A0(2x4) * B0(4x8) + A1(2x4) *
B1(4x8), where A0 is left half of A, A1 is right half of A, B0 is upper
left 4x8 matrix of B, B1 is lower left 4x8 matrix of B, and C0 is leftmost 2x8 matrix of
C.
The data for all matrices are assumed to be row-based in memory. Matrix A
is read into a vector, per instructions. Thus, it requires some data filtering for mmul
. B0 and B1 are read a row (eight elements) at a time. Four
rows are combined for mmul
. The indexes of two rows of C0
need to be calculated and two rows of C0 are written to memory separately.
The matrix multiplication is illustrated as follows:
And the corresponding example code is as follows:
mmul
, which is not intended for performance optimization.#include <aie_api/aie.hpp>
#include <aie_api/aie_adf.hpp>
#include "aie_api/utils.hpp"
//For element mmul
const int M=2;
const int K=4;
const int N=8;
//Total matrix sizes
const int rowA=2;
const int colA=8;
const int colB=64;
const int SHIFT_BITS=0;
//Derived parameters
const int HALF_B_NUM=8*64/2;
using namespace adf;
using MMUL = aie::mmul<M, K, N, int16, int16>;
__attribute__((noinline)) void matmul_mmul(input_buffer<int16>& __restrict data0, input_buffer<int16>& __restrict data1, output_buffer<int16>& __restrict out){
auto pa=aie::begin_vector<MMUL::size_A*2>(data0);
aie::vector<int16,MMUL::size_A*2> va=*pa;
//select left half matrix of A into va0
aie::vector<int16,MMUL::size_A> va0=aie::filter_even(va,4);
//select right half matrix of A into va1
aie::vector<int16,MMUL::size_A> va1=aie::filter_odd(va,4);
auto pb0=aie::begin_vector<N>(data1);
//Note that every accumulation on pointer advances N(=8) elements
auto pb1=pb0+HALF_B_NUM/N;
aie::vector<int16,N> vb0_[4];
aie::vector<int16,N> vb1_[4];
aie::vector<int16,MMUL::size_C> vc;
auto pc=aie::begin_vector<N>(out);
LOOP_on_COL:for(int i=0;i<colB/N;i++)
chess_prepare_for_pipelining
{
for(int j=0;j<4;j++){
vb0_[j]=*pb0;
pb0+=colB/N;
vb1_[j]=*pb1;
pb1+=colB/N;
}
MMUL m;
m.mul(va0,aie::concat(vb0_[0],vb0_[1],vb0_[2],vb0_[3]));
m.mac(va1,aie::concat(vb1_[0],vb1_[1],vb1_[2],vb1_[3]));
vc=m.to_vector<int16>(SHIFT_BITS);//right shift SHIFT_BITS
*pc=vc.extract<N>(0);
//Note that every accumulation on pointer advances N(=8) elements
pc+=colB/N;
*pc=vc.extract<N>(1);
pc-=(colB/N-1);
pb0-=(colB/N*4-1);
pb1-=(colB/N*4-1);
}
}