Because aie::mmul
accepts row-based vector
data for shape of matrix multiplication, it may require re-arranging input data for
performance. This section assumes that the original data is row based for whole matrices. It
re-arranges the data to match the shape 4*16*8
used in the
matrix multiplication. The compiler can automatically pipeline the loop when the input is
re-arranged into chunks to form 4*16
and 16*8
matrices and is supplied as an input to the aie::mmul
API. With this, the multiplication occurs in parallel
within the loop to achieve the improved loop latency.
Generally, this data re-arranging is done in the PL or AI Engine-ML. But, the AI Engine-ML memory tile is
introduced in the AI Engine-ML
architecture to significantly increase the on-chip memory inside the AI Engine-ML array. These memory
tiles are created and configured to re-arrange the data to match the shape 4*16*8
in the graph code as shown below.
class SimpleGraph : public adf::graph {
public :
output_plio out;
input_plio in0,in1;
private:
adf::kernel k;
shared_buffer<int8> mtxA,mtxB,mtxC;
public:
SimpleGraph() {
out=output_plio::create("Dataout0", plio_64_bits, "data/output0.txt");
in0=input_plio::create("Datain0", plio_64_bits, "data/matA.txt");
in1=input_plio::create("Datain1", plio_64_bits, "data/matB.txt");
k = adf::kernel::create(matrix_mul);
mtxA = shared_buffer<int8>::create({64,64}, 1, 1);//elements
mtxB = shared_buffer<int8>::create({64,64}, 1, 1);
mtxC = shared_buffer<int8>::create({64,64}, 1, 1);
adf::connect(in0.out[0], mtxA.in[0]);
adf::connect(in1.out[0], mtxB.in[0]);
adf::connect(mtxA.out[0], k.in[0]);
adf::connect(mtxB.out[0], k.in[1]);
adf::connect(k.out[0], mtxC.in[0]);
adf::connect(mtxC.out[0], out.in[0]);
num_buffers(mtxA)=2;
num_buffers(mtxB)=2;
num_buffers(mtxC)=2;
adf::source(k) = "matrix_mul.cc";
runtime<ratio>(k) = 0.9;
write_access(mtxA.in[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxA' data to MEM tile
write_access(mtxB.in[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxB' data to MEM tile
write_access(mtxC.in[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 4 }, .offset = { 0, 0 }, .tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=4, .wrap=16}} }); //in elements //matC re-arrange 4x8
read_access(mtxA.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 16, 4 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=16, .wrap=4}, {.dimension=1, .stride=4, .wrap=16}} });//matA re-arrange 4x16
read_access(mtxB.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 16 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=16, .wrap=4}} });//matB re-arrange 16x8
read_access(mtxC.out[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} });
adf::dimensions(k.in[0])={4096};//elements
adf::dimensions(k.in[1])={4096};
adf::dimensions(k.out[0])={4096};
}
};
You can see the shared buffers mtxA
,
mtxB
and mtxC
are
configured using the tiling
parameter object associated
with the memory. The data transfer occurs on a tile basis and in this case, the elements of
size (tilling dimension) 4*16
and 16*8
are used to read from the buffer corresponding to the mtxA
and mtxB
, respectively. The
tiling
parameter of the shared buffer mtxC
is configured to write the matrix multiplication output of
size 4*8
. This is in-lined to achieve the data re-arranging
of 4*16*8
.
The following figure illustrates the read access patterns of the shared
buffer mtxA
.
mtxA
The following figure illustrates the read access patterns of the shared
buffer mtxB
.
mtxB
As you can see, the mtxA
of size 64*64
(buffer_dimension)
is
broken down into chunks of 4*16
(tiling_dimension
). The starting element to traverse within the buffer (offset
) is {0,0}
and the
traversing parameters describe how the buffer will be accessed (tile_traversal
). The dimension of the traversing in this case is 0th and 1st
(dimension
). In dimension 0, the distance in terms of
buffer element data type between consecutive inter-tile traversal is '16'(stride
) and the number of tiles to access in this dimension is
'4'(wrap
). Similarly in dimension 1, stride
is '4' and wrap
is
'16'.
A similar explanation follows for the shared buffer mtxB
.
The matrix multiplication output is written to the shared buffer mtxC
, as shown in the below access pattern diagram.
mtxC
With the above data re-arranging in AI Engine-ML memory tile, it does not need data shuffle kernels or data shuffle inside the matrix multiplication kernel.