Re-arranging Data in AI Engine-ML Memory Tile - 2025.1 English - UG1603

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2025-05-29
Version
2025.1 English

Because aie::mmul accepts row-based vector data for shape of matrix multiplication, it might require re-arranging input data for performance. This section assumes that the original data is row-based for whole matrices. It re-arranges the data to match the shape 4*16*8 used in the matrix multiplication. The compiler can automatically pipeline the loop when the input is re-arranged into chunks to form 4*16 and 16*8 matrices and is supplied as an input to the aie::mmul API. With this, the multiplication occurs in parallel within the loop to achieve the improved loop latency.

Generally, this data re-arranging is done in the PL or AI Engine. The memory tile is introduced in the AIE-ML / AIE-ML v2 architecture to significantly increase the on-chip memory inside the AI Engine array. These memory tiles are created and configured to re-arrange the data to match the shape 4*16*8 in the graph code as shown below.

class SimpleGraph : public adf::graph {

public :
  output_plio out;
  input_plio in0,in1;
private:
  adf::kernel k;
  shared_buffer<int8> mtxA,mtxB,mtxC;

public:
SimpleGraph() {
    out=output_plio::create("Dataout0", plio_64_bits, "data/output0.txt");
    in0=input_plio::create("Datain0", plio_64_bits, "data/matA.txt");
    in1=input_plio::create("Datain1", plio_64_bits, "data/matB.txt");


k = adf::kernel::create(matrix_mul);

mtxA = shared_buffer<int8>::create({64,64}, 1, 1);//elements
mtxB = shared_buffer<int8>::create({64,64}, 1, 1);
mtxC = shared_buffer<int8>::create({64,64}, 1, 1);


adf::connect(in0.out[0], mtxA.in[0]);
adf::connect(in1.out[0], mtxB.in[0]);
adf::connect(mtxA.out[0], k.in[0]);
adf::connect(mtxB.out[0], k.in[1]);
adf::connect(k.out[0], mtxC.in[0]);
adf::connect(mtxC.out[0], out.in[0]);

num_buffers(mtxA)=2;
num_buffers(mtxB)=2;
num_buffers(mtxC)=2;


adf::source(k) = "matrix_mul.cc";
runtime<ratio>(k) = 0.9;


write_access(mtxA.in[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxA' data to MEM tile

write_access(mtxB.in[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} }); //Write 'mtxB' data to MEM tile

write_access(mtxC.in[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 4 }, .offset = { 0, 0 }, .tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=4, .wrap=16}} }); //in elements //matC re-arrange 4x8

read_access(mtxA.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 16, 4 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=16, .wrap=4}, {.dimension=1, .stride=4, .wrap=16}} });//matA re-arrange 4x16

read_access(mtxB.out[0]) = tiling({ .buffer_dimension = { 64, 64 },.tiling_dimension = { 8, 16 }, .offset = { 0, 0 },.tile_traversal = {{.dimension=0, .stride=8, .wrap=8}, {.dimension=1, .stride=16, .wrap=4}} });//matB re-arrange 16x8

read_access(mtxC.out[0]) = tiling({.buffer_dimension={64,64}, .tiling_dimension={64,64}, .offset={0,0} });


adf::dimensions(k.in[0])={4096};//elements
adf::dimensions(k.in[1])={4096};
adf::dimensions(k.out[0])={4096};

}
};

You can see the shared buffers mtxA, mtxB and mtxC are configured using the tiling parameter object associated with the memory. The data transfer occurs on a tile basis and in this case, the elements of size (tilling dimension) 4*16 and 16*8 are used to read from the buffer corresponding to the mtxAand mtxB, respectively. The tiling parameter of the shared buffer mtxC is configured to write the matrix multiplication output of size 4*8. This is in-lined to achieve the data re-arranging of 4*16*8.

The following figure illustrates the read access patterns of the shared buffer mtxA.

Figure 1. Read Access Data Pattern for Shared Buffer mtxA

The following figure illustrates the read access patterns of the shared buffer mtxB.

Figure 2. Read Access Data Pattern for Shared Buffer mtxB

As you can see, the mtxA of size 64*64(buffer_dimension) is broken down into chunks of 4*16(tiling_dimension). The starting element to traverse within the buffer (offset) is {0,0} and the traversing parameters describe how the buffer will be accessed (tile_traversal). The dimension of the traversing in this case is 0th and 1st (dimension). In dimension 0, the distance in terms of buffer element data type between consecutive inter-tile traversal is '16'(stride) and the number of tiles to access in this dimension is '4'(wrap). Similarly in dimension 1, stride is '4' and wrap is '16'.

A similar explanation follows for the shared buffer mtxB.

The matrix multiplication output is written to the shared buffer mtxC, as shown in the below access pattern diagram.

Figure 3. Matrix Multiplication Output Data Pattern in Shared Buffer mtxC

With the above data re-arranging in the memory tile, it does not need data shuffle kernels or data shuffle inside the matrix multiplication kernel.

Note: Block floating-point data type is specific in the sense that 16 values are encoded in 18 bytes (mx9), 12 bytes (mx6), or 8 bytes (mx4). Addresses being aligned on 32 bits (4 bytes), 16 values can be transferred for mx6 and mx4, unfortunately an mx9 block being encoded on 18 bytes only an even number of mx9 blocks can be manipulated by the Memory Tile DMA and all other DMAs (memory module and interface tile).

This Vitis tutorial implements such a matrix multiplication with re-arranged data rt. In the tutorial a 64x64x64 matrix multiplication is implemented with 16-bit and 32-bit outputs using a basic block-by-block approach, or an optimized approach that combines some data reading.

In terms of function duration in cycles and vector operation per cycles the results are as follows:

Table 1. Number of Cycles per Function Run:
  32-bit output 16-bit output
Basic Approach 2344 2075
Optimized Reading 1380 1122
Table 2. Number of MAC Operations per Cycle:
  32-bit output 16-bit output
Basic Approach 111 126
Optimized Reading 190 234