Re-arranging Data in AI Engine-ML Memory Tile - 2024.1 English

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
Release Date
2024.1 English

Because aie::mmul accepts row-based vector data for shape of matrix multiplication, it may require re-arranging input data for performance. This section assumes that the original data is row based for whole matrices. It re-arranges the data to match the shape 4*16*8 used in the matrix multiplication. The compiler can automatically pipeline the loop when the input is re-arranged into chunks to form 4*16 and 16*8 matrices and is supplied as an input to the aie::mmul API. With this, the multiplication occurs in parallel within the loop to achieve the improved loop latency.

Generally, this data re-arranging is done in the PL or AI Engine-ML. But, the AI Engine-ML memory tile is introduced in the AI Engine-ML architecture to significantly increase the on-chip memory inside the AI Engine-ML array. These memory tiles are created and configured to re-arrange the data to match the shape 4*16*8 in the graph code as shown below.

class SimpleGraph : public adf::graph {

public :
  output_plio out;
  input_plio in0,in1;
  adf::kernel k;
  shared_buffer<int8> mtxA,mtxB,mtxC;

SimpleGraph() {
    out=output_plio::create("Dataout0", plio_64_bits, "data/output0.txt");
    in0=input_plio::create("Datain0", plio_64_bits, "data/matA.txt");
    in1=input_plio::create("Datain1", plio_64_bits, "data/matB.txt");

k = adf::kernel::create(matrix_mul);

mtxA = shared_buffer<int8>::create({64,64}, 1, 1);//elements
mtxB = shared_buffer<int8>::create({64,64}, 1, 1);
mtxC = shared_buffer<int8>::create({64,64}, 1, 1);



adf::source(k) = "";
runtime<ratio>(k) = 0.9;

write_access([0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxA' data to MEM tile

write_access([0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxB' data to MEM tile

write_access([0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 4 }, .offset = { 0, 0 }, .tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=4, .wrap=16}} }); //in elements //matC re-arrange 4x8

read_access(mtxA.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 16, 4 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=16, .wrap=4}, {.dimension=1, .stride=4, .wrap=16}} });//matA re-arrange 4x16

read_access(mtxB.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 16 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=16, .wrap=4}} });//matB re-arrange 16x8

read_access(mtxC.out[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} });



You can see the shared buffers mtxA, mtxB and mtxC are configured using the tiling parameter object associated with the memory. The data transfer occurs on a tile basis and in this case, the elements of size (tilling dimension) 4*16 and 16*8 are used to read from the buffer corresponding to the mtxAand mtxB, respectively. The tiling parameter of the shared buffer mtxC is configured to write the matrix multiplication output of size 4*8. This is in-lined to achieve the data re-arranging of 4*16*8.

The following figure illustrates the read access patterns of the shared buffer mtxA.

Figure 1. Read Access Data Pattern for Shared Buffer mtxA

The following figure illustrates the read access patterns of the shared buffer mtxB.

Figure 2. Read Access Data Pattern for Shared Buffer mtxB

As you can see, the mtxA of size 64*64(buffer_dimension) is broken down into chunks of 4*16(tiling_dimension). The starting element to traverse within the buffer (offset) is {0,0} and the traversing parameters describe how the buffer will be accessed (tile_traversal). The dimension of the traversing in this case is 0th and 1st (dimension). In dimension 0, the distance in terms of buffer element data type between consecutive inter-tile traversal is '16'(stride) and the number of tiles to access in this dimension is '4'(wrap). Similarly in dimension 1, stride is '4' and wrap is '16'.

A similar explanation follows for the shared buffer mtxB.

The matrix multiplication output is written to the shared buffer mtxC, as shown in the below access pattern diagram.

Figure 3. Matrix Multiplication Output Data Pattern in Shared Buffer mtxC

With the above data re-arranging in AI Engine-ML memory tile, the achieved latency of the loop is around 2303 cycles, which is roughly 262144/2303 ~= 113 int8*int8 MACs per cycle.