Because aie::mmul accepts row-based vector
data for shape of matrix multiplication, it may require re-arranging input data for
performance. This section assumes that the original data is row based for whole matrices. It
re-arranges the data to match the shape 4*16*8 used in the
matrix multiplication. The compiler can automatically pipeline the loop when the input is
re-arranged into chunks to form 4*16 and 16*8 matrices and is supplied as an input to the aie::mmul API. With this, the multiplication occurs in parallel
within the loop to achieve the improved loop latency.
Generally, this data re-arranging is done in the PL or AI Engine-ML. But, the AI Engine-ML memory tile is
introduced in the AI Engine-ML
architecture to significantly increase the on-chip memory inside the AI Engine-ML array. These memory
tiles are created and configured to re-arrange the data to match the shape 4*16*8 in the graph code as shown below.
class SimpleGraph : public adf::graph {
public :
output_plio out;
input_plio in0,in1;
private:
adf::kernel k;
shared_buffer<int8> mtxA,mtxB,mtxC;
public:
SimpleGraph() {
out=output_plio::create("Dataout0", plio_64_bits, "data/output0.txt");
in0=input_plio::create("Datain0", plio_64_bits, "data/matA.txt");
in1=input_plio::create("Datain1", plio_64_bits, "data/matB.txt");
k = adf::kernel::create(matrix_mul);
mtxA = shared_buffer<int8>::create({64,64}, 1, 1);//elements
mtxB = shared_buffer<int8>::create({64,64}, 1, 1);
mtxC = shared_buffer<int8>::create({64,64}, 1, 1);
adf::connect(in0.out[0], mtxA.in[0]);
adf::connect(in1.out[0], mtxB.in[0]);
adf::connect(mtxA.out[0], k.in[0]);
adf::connect(mtxB.out[0], k.in[1]);
adf::connect(k.out[0], mtxC.in[0]);
adf::connect(mtxC.out[0], out.in[0]);
num_buffers(mtxA)=2;
num_buffers(mtxB)=2;
num_buffers(mtxC)=2;
adf::source(k) = "matrix_mul.cc";
runtime<ratio>(k) = 0.9;
write_access(mtxA.in[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxA' data to MEM tile
write_access(mtxB.in[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxB' data to MEM tile
write_access(mtxC.in[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 4 }, .offset = { 0, 0 }, .tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=4, .wrap=16}} }); //in elements //matC re-arrange 4x8
read_access(mtxA.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 16, 4 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=16, .wrap=4}, {.dimension=1, .stride=4, .wrap=16}} });//matA re-arrange 4x16
read_access(mtxB.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 16 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=16, .wrap=4}} });//matB re-arrange 16x8
read_access(mtxC.out[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} });
adf::dimensions(k.in[0])={4096};//elements
adf::dimensions(k.in[1])={4096};
adf::dimensions(k.out[0])={4096};
}
};
You can see the shared buffers mtxA,
mtxB and mtxC are
configured using the tiling parameter object associated
with the memory. The data transfer occurs on a tile basis and in this case, the elements of
size (tilling dimension) 4*16 and 16*8 are used to read from the buffer corresponding to the mtxAand mtxB, respectively. The
tiling parameter of the shared buffer mtxC is configured to write the matrix multiplication output of
size 4*8. This is in-lined to achieve the data re-arranging
of 4*16*8.
The following figure illustrates the read access patterns of the shared
buffer mtxA.
mtxA
The following figure illustrates the read access patterns of the shared
buffer mtxB.
mtxB
As you can see, the mtxA of size 64*64(buffer_dimension) is
broken down into chunks of 4*16(tiling_dimension). The starting element to traverse within the buffer (offset) is {0,0} and the
traversing parameters describe how the buffer will be accessed (tile_traversal). The dimension of the traversing in this case is 0th and 1st
(dimension). In dimension 0, the distance in terms of
buffer element data type between consecutive inter-tile traversal is '16'(stride) and the number of tiles to access in this dimension is
'4'(wrap). Similarly in dimension 1, stride is '4' and wrap is
'16'.
A similar explanation follows for the shared buffer mtxB.
The matrix multiplication output is written to the shared buffer mtxC, as shown in the below access pattern diagram.
mtxC
With the above data re-arranging in AI Engine-ML memory tile, the achieved
latency of the loop is around 2303 cycles, which is roughly 262144/2303 ~= 113 int8*int8 MACs per cycle.