This tutorial has been built to allow the user to easily change matrices and sub-matrices sizes. Matrix A being of size (M,K) and matrix B of size (K,N), the resulting matrix C has size (M,N). The Makefile
defines these default values to 64 (sizeM, sizeK, sizeN
). The size of the sub-matrices used by the AIE API is also defined (subM, subK, subN
). All these values can be overriden in the make
command line.
In this part we focus on a straightforward implementation of the matrix multiply which will be selected by the macro OPTIMIZED_SOURCE = 0
. The make
command will be invoked using make OPT=0 ...
which is actually the default.
# Default values for A, B, C matrix sizes
# A:MxK B:KxN C:MxN
sizeM ?= 64
sizeK ?= 64
sizeN ?= 64
# Default for A, B and C sub matrices
# 4x16x8
subM ?= 4
subK ?= 16
subN ?= 8
#Default Number of iterations
NIterations ?= 16
The system_settings.h
header file defines all the sizes that will be used internally by the kernel:
// Multiply 2 matrices (MxK) x (KxN)
#define A_ROWS sizeM
#define A_COLS sizeK
#define B_ROWS A_COLS
#define B_COLS sizeN
#define C_ROWS A_ROWS
#define C_COLS B_COLS
// Non Sparse Tiling: 4x16x8
#define ATILES_ROWS_NS subM
#define ATILES_COLS_NS subK
#define BTILES_ROWS_NS ATILES_COLS_NS
#define BTILES_COLS_NS subN
#define CTILES_ROWS_NS ATILES_ROWS_NS
#define CTILES_COLS_NS BTILES_COLS_NS
As explained in previous section, the matrices will be transferred from DDR to memory tile without any change, and then from memory tile to AI Engine-ML memory with a reordering of the data to make them easier to read from the kernel.
Even the write access pattern to the memory tile on the input side as well as read access pattern on the output side is just linear contiguous addressing, it needs to be specified in the graph. All these tiling parameters are defined in the file tiling_parameters.h
. Let’s have a look to these parameters for the input matrix A:
adf::tiling_parameters WriteAns_pattern = {
.buffer_dimension={A_COLS,A_ROWS},
.tiling_dimension={A_COLS,1},
.offset={0,0},
.tile_traversal={
{.dimension=1, .stride=1, .wrap=A_ROWS}
}
};
adf::tiling_parameters ReadAns_pattern = {
.buffer_dimension={A_COLS,A_ROWS},
.tiling_dimension={ATILES_COLS_NS,ATILES_ROWS_NS},
.offset={0,0},
.tile_traversal={
{.dimension=0, .stride=ATILES_COLS_NS, .wrap=A_COLS/ATILES_COLS_NS},
{.dimension=1, .stride=ATILES_ROWS_NS, .wrap=A_ROWS/ATILES_ROWS_NS}
}
};
The matrix is a 2D set of data dimension 0 being the number of columns, dimension 1 being the number of rows. When writing to the memory tile, data is stored column major in the memory. The read access of matrix A is completely different as we read the data block by block, each block being a sub-matrix of the matrix multiplication of the API, and we read the blocks column major from the memory (dimension 0 then dimension 1). For the matrix B it will be the same except that the block reading will be done row major (dimension 1 then dimension 0). C Matrix is written block by block, column major. The following animated GIF gives you the order the various A, B and C blocks are read and written to memory
The data storage at kernel level is declared as 2D just to clarify the way it is stored but we use data pointers (essentially 1D data access) in the kernel code:
std::vector<uint32> DimAin = {
ATILES_COLS_NS*ATILES_ROWS_NS, // Tile size
A_ROWS*A_COLS/ATILES_COLS_NS/ATILES_ROWS_NS // Total number of Tiles
};
std::vector<uint32> DimBin = {
BTILES_COLS_NS*BTILES_ROWS_NS, // Tile size
B_ROWS*B_COLS/BTILES_COLS_NS/BTILES_ROWS_NS // Total number of Tiles
};
The matrix multiplication kernel is very simple to write as the data have been reordered. Computing a block row of the output matrix requires to read multiple times the same block row of matrix A and the entire matrix B:
template<typename ITYPE,typename OTYPE, int SHIFT_RESULT>
void ClassicMatMult(adf::input_buffer<ITYPE,adf::extents<adf::inherited_extent,adf::inherited_extent>> & __restrict inA,
adf::input_buffer<ITYPE,adf::extents<adf::inherited_extent,adf::inherited_extent>> & __restrict inB,
adf::output_buffer<OTYPE,adf::extents<adf::inherited_extent,adf::inherited_extent>> & __restrict outC)
{
constexpr size_t sizeTileA = ATILES_ROWS * ATILES_COLS;
constexpr size_t sizeTileB = BTILES_ROWS * BTILES_COLS;
constexpr size_t sizeTileC = CTILES_ROWS * CTILES_COLS;
constexpr size_t NTilesPerRow_A = A_ROWS / ATILES_ROWS;
constexpr size_t NTilesPerCol_A = A_COLS / ATILES_COLS;
constexpr size_t NTilesPerRow_B = B_ROWS / BTILES_ROWS;
constexpr size_t NTilesPerCol_B = B_COLS / BTILES_COLS;
constexpr size_t NTilesPerRow_C = C_ROWS / CTILES_ROWS;
constexpr size_t NTilesPerCol_C = C_COLS / CTILES_COLS;
auto pA = aie::begin_vector<sizeTileA>(inA);
auto pB = aie::begin_vector<sizeTileB>(inB);
auto pC = aie::begin_vector<sizeTileC>(outC);
aie::mmul<ATILES_ROWS, ATILES_COLS, CTILES_COLS, ITYPE, ITYPE, acc32> ctile;
for (int i = 0; i < NTilesPerRow_C; i++)
{
for (int j = 0; j < NTilesPerCol_C; j++)
chess_prepare_for_pipelining
{
auto a = *pA++;
auto b = *pB++;
ctile.mul(a, b);
for (int k = 1; k < NTilesPerCol_A; k++)
// chess_unroll_loop(*)
chess_flatten_loop
{
a = *pA++;
b = *pB++;
ctile.mac(a, b);
}
*pC++ = ctile.template to_vector<OTYPE>(SHIFT_RESULT);
pA -= NTilesPerCol_A; // Back to beginning of row
// For matrix B the next tile is used
}
pA += NTilesPerCol_A; // Next Row
pB -= NTilesPerCol_B * NTilesPerRow_B; // Back to beginning of matrix B
}
}
Pointers pA, pB
and pC
are declared as pointers to data chunks of size equal to the sizes of the various sub-matrices. It makes it very simple to read the sub-matrices and to move the pointers. For each output sub-matrix a row of A and a column of B is read. A rows are contiguous in memory as well as B column. This makes easy the pointer evolution: just a post-incrementation. For each new output sub-matrix we need to move the A pointer back to the beginning of the row, but B pointer can continue its regular evolution. At the end of an output matrix row, the pointer of A has to be moved to the beginning of the next A row, and B pointer has to be reinitialized to the beginning of B matrix.
This kernel is built for int8
input data type and either int32
or int16
output data type, in the latter case, a simple right shift of 6 bits is performed to overcome the accumulation on 64 data. In the graph both versions are instantiated in column 10 and 20.
class TestMatMult: public graph {
public:
input_plio inA1,inB1;
output_plio outC1;
input_plio inA2,inB2;
output_plio outC2;
**MatrixMultiply<int8,int32,0,10> MMult1;
MatrixMultiply<int8,int16,6,20> MMult2;**
TestMatMult(){
inA1 = adf::input_plio::create("inputA1",adf::plio_128_bits,"data/inputA_128.txt",250);
inB1 = adf::input_plio::create("inputB1",adf::plio_128_bits,"data/inputB_128.txt",250);
outC1 = adf::output_plio::create("outputC1",adf::plio_128_bits,"data/outputCns_128_32b.txt",250);
adf::connect(inA1.out[0],MMult1.inA);
adf::connect(inB1.out[0],MMult1.inB);
adf::connect(MMult1.outC,outC1.in[0]);
inA2 = adf::input_plio::create("inputA2",adf::plio_128_bits,"data/inputA_128.txt",250);
inB2 = adf::input_plio::create("inputB2",adf::plio_128_bits,"data/inputB_128.txt",250);
outC2 = adf::output_plio::create("outputC2",adf::plio_128_bits,"data/outputCns_128_16b.txt",250);
adf::connect(inA2.out[0],MMult2.inA);
adf::connect(inB2.out[0],MMult2.inB);
adf::connect(MMult2.outC,outC2.in[0]);
};
};