In this new version of the kernel, we want to load 2 A sub-matrices while we load a single B sub-matrix. The 2 A sub-matrices must belong to the same tile column so that they have to be multiplied by the same B sub-matrix.
The simplest id to take 2 A tiles just one above the other, and multiply them by the same B sub-matrix. On the C side, the 2 tiles that will be computed will be also just one above the other.
In order to avoid too many pointer manipulations, the A tiles will be read 2 by 2 from Memory Tile so that they will be stored right next to each other in AI Engine ML Memory. B tiles will be read as in the previous basic solutions. Similarly to A, C tiles will be stored side by side in the AI Engine ML Memory. They will be reorganized when copying into the Memory Tile.
This way to do offloads the pointer manipulation to the DMA programming, freeing some scalar processor cycles.
The next 2 animated GIFs will show how A matrix is read from the Memory Tile and how C matrix is written to it. You can see that I chose to have super tiles consisting of 2 sub-matrices one above the other:
These read write orders are obtained using the following tiling parameters:
adf::tiling_parameters ReadAns_pattern = {
.buffer_dimension={A_COLS,A_ROWS},
.tiling_dimension={ATILES_COLS,ATILES_ROWS*2},
.offset={0,0},
.tile_traversal={
{.dimension=0, .stride=ATILES_COLS, .wrap=A_COLS/ATILES_COLS},
{.dimension=1, .stride=ATILES_ROWS*2, .wrap=A_ROWS/ATILES_ROWS/2}
}
};
adf::tiling_parameters WriteCns_pattern = {
.buffer_dimension={C_COLS,C_ROWS},
.tiling_dimension={CTILES_COLS,CTILES_ROWS*2},
.offset={0,0},
.tile_traversal={
{.dimension=0, .stride=CTILES_COLS, .wrap=C_COLS/CTILES_COLS},
{.dimension=1, .stride=2*CTILES_ROWS, .wrap=C_ROWS/CTILES_ROWS/2}
}
};
These parameters are very similar to the previous ones except that the vertical dimension is doubled.
The C++ code is also changed as we now load 2 A sub-matrices and compute 2 C sub matrices per iteration:
template<typename ITYPE,typename OTYPE, int SHIFT_RESULT>
void ClassicMatMult(adf::input_buffer<ITYPE, adf::extents<adf::inherited_extent, adf::inherited_extent>> &__restrict inA,
adf::input_buffer<ITYPE, adf::extents<adf::inherited_extent, adf::inherited_extent>> &__restrict inB,
adf::output_buffer<OTYPE, adf::extents<adf::inherited_extent, adf::inherited_extent>> &__restrict outC)
{
constexpr size_t sizeTileA = ATILES_ROWS * ATILES_COLS;
constexpr size_t sizeTileB = BTILES_ROWS * BTILES_COLS;
constexpr size_t sizeTileC = CTILES_ROWS * CTILES_COLS;
constexpr size_t NTilesPerRow_A = A_ROWS / ATILES_ROWS;
constexpr size_t NTilesPerCol_A = A_COLS / ATILES_COLS;
constexpr size_t NTilesPerRow_B = B_ROWS / BTILES_ROWS;
constexpr size_t NTilesPerCol_B = B_COLS / BTILES_COLS;
constexpr size_t NTilesPerRow_C = C_ROWS / CTILES_ROWS;
constexpr size_t NTilesPerCol_C = C_COLS / CTILES_COLS;
auto pA = aie::begin_vector<sizeTileA>(inA);
auto pB = aie::begin_vector<sizeTileB>(inB);
auto pC = aie::begin_vector<sizeTileC>(outC);
aie::mmul<ATILES_ROWS, ATILES_COLS, CTILES_COLS, ITYPE, ITYPE, acc32> ctile1;
aie::mmul<ATILES_ROWS, ATILES_COLS, CTILES_COLS, ITYPE, ITYPE, acc32> ctile2;
for (int i = 0; i < NTilesPerRow_C / 2; i++)
{
for (int j = 0; j < NTilesPerCol_C; j++)
chess_prepare_for_pipelining
chess_loop_range(4, )
{
auto a1 = *pA++;
auto a2 = *pA++;
auto b = *pB++;
ctile1.mul(a1, b);
ctile2.mul(a2, b);
for (int k = 1; k < NTilesPerCol_A; k++)
chess_flatten_loop
{
a1 = *pA++;
a2 = *pA++;
b = *pB++;
ctile1.mac(a1, b);
ctile2.mac(a2, b);
}
*pC++ = ctile1.template to_vector<OTYPE>(SHIFT_RESULT);
*pC++ = ctile2.template to_vector<OTYPE>(SHIFT_RESULT);
pA -= 2 * NTilesPerCol_A; // Back to begining of row
// For matrix B the next tile is used
}
pA += 2 * NTilesPerCol_A; // Next Row
pB -= NTilesPerCol_B * NTilesPerRow_B; // Back to begining of matrix B
}
}
The main difference is that we now have 2 mmul operators that are used to compute the 2 C sub-matrices.