Multi-Level AI Engine-ML Data Access - 2024.2 English - UG1603

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2024-11-28
Version
2024.2 English

As written above, tiling parameterization can be associated to each port that uses a DMA. If the access scheme is linear on the overall buffer, the tiling specification is optional. Tiling parameterization on both ends of a connection is subject to a single constraint: the overall data transfer volume should be the same on both ends.

In the following example, tiling parameterization is required at the following levels:

  • External Memory (DDR) (known as external_buffer in the graph)
  • AI Engine-ML Memory Tile (MEM Tile) (known as shared_buffer in the graph)
Figure 1. Multi-Memory Level System

In this example, the full data set resides in the DDR and must be split in three to fit in a MEM Tile. The AI Engine-ML core is able to process a fourth of these data per iteration, hence, the MEM Tile data is split in four, which each section being sent one by one into the AI Engine-ML Memory. The processed data is sent to another MEM Tile which is now divided in two to be sent to another AI Engine Memory for a second kernel processing. The processed data is then going to another MEM Tile and back to the DDR. Here is a graph view of this application:
Figure 2. Multi Level Memory System Graph View

The application is parameterized with the following values:

NITERATIONS
Number of iterations handled by the main host code.
NPARTS
The global data set in the DDR is divided into NPARTS sections which are sent to the MEM Tiles.
NFRAMES
Each section is divided in NFRAMES sub-sections which are sent to the AI Engine Memory.
NVECTORS
Each sub-section is split into NVECTORS vectors which are handled by the kernel in a single run.
VECTOR_LENGTH
Size of the basic vector processed by the kernels.

For this application, NFRAME, NVECTORS and VECTOR_LENGTH are specific to each kernel (K1, K2). The file tiling_parameters.h groups all tiling parameters:

// Complete Dataset sizes
const uint32_t totalSize1 = NITERATIONS*NPARTS*NFRAMES_1*NVECTORS_1*VECTOR_LENGTH_1;
const uint32_t totalSize2 = NITERATIONS*NPARTS*NFRAMES_2*NVECTORS_2*VECTOR_LENGTH_2;

// Input and Output DDR size for all dimensions
const std::vector<uint32_t> ddr_size1 = { VECTOR_LENGTH_1*NVECTORS_1, NFRAMES_1, NPARTS};
const std::vector<uint32_t> ddr_size2 = { VECTOR_LENGTH_2*NVECTORS_2, NFRAMES_2, NPARTS};

// MEM Tile data size
const std::vector<uint32_t> shared_mem_size1 = {VECTOR_LENGTH_1,NVECTORS_1, NFRAMES_1};
const std::vector<uint32_t> shared_mem_size2 = {VECTOR_LENGTH_2,NVECTORS_2, NFRAMES_2};

// AI Engine-ML buffer size
const std::vector<uint32_t> buffer_size1 = {VECTOR_LENGTH_1,NVECTORS_1};
const std::vector<uint32_t> buffer_size2 = {VECTOR_LENGTH_2,NVECTORS_2};

// Parameter given to the kernels
const int LoopSize_1 = VECTOR_LENGTH_1*NVECTORS_1;
const int LoopSize_2 = VECTOR_LENGTH_2*NVECTORS_2;

// Number of times a kernel should be run for each iteration
const uint32_t bufferRepetition_1 = NFRAMES_1*NPARTS;
const uint32_t bufferRepetition_2 = NFRAMES_2*NPARTS;


// Tiling Parameter for the input and output DDR
adf::tiling_parameters DDR_pattern1 = {
.buffer_dimension=ddr_size1,
.tiling_dimension={VECTOR_LENGTH_1*NVECTORS_1, NFRAMES_1,1},
.offset={0,0,0},
.tile_traversal={{.dimension=2, .stride=1, .wrap=NPARTS}}
};

adf::tiling_parameters DDR_pattern2 = {
.buffer_dimension=ddr_size2,
.tiling_dimension={VECTOR_LENGTH_2*NVECTORS_2, NFRAMES_2,1},
.offset={0,0,0},
.tile_traversal={{.dimension=2, .stride=1, .wrap=NPARTS}}
};

// Tiling Parameter for MEM Tiles

adf::tiling_parameters MEM_pattern1 = {
.buffer_dimension=shared_mem_size1,
.tiling_dimension={VECTOR_LENGTH_1,NVECTORS_1, 1},
.offset={0,0,0},
.tile_traversal={{.dimension=2, .stride=1, .wrap=NFRAMES_1}}
};

adf::tiling_parameters MEM_pattern2 = {
.buffer_dimension=shared_mem_size2,
.tiling_dimension={VECTOR_LENGTH_2,NVECTORS_2, 1},
.offset={0,0,0},
.tile_traversal={{.dimension=2, .stride=1, .wrap=NFRAMES_2}}
};

The graph construction will use all these tiling parameters to parameterize input and output DMAs. When defining the kernels K1 and K2, the number of runs of these kernels for each iteration must be specified:

class transfer_control : public graph {
public:
kernel K1,K2;
external_buffer<uint32> ddrin,ddrout;
shared_buffer<uint32> mtx1,mtx2,mtx3;

transfer_control() {

	// kernels
	K1 = kernel::create_object<PassThrough>(1,LoopSize_1);
	source(K1) = "src/passthrough.cpp";
	headers(K1) = {"src/passthrough.h"};
	runtime<ratio>(K1) = 0.9;
	repetition_count(K1) = bufferRepetition_1;
	
	K2 = kernel::create_object<PassThrough>(2,LoopSize_2);
	source(K2) = "src/passthrough.cpp";
	headers(K2) = {"src/passthrough.h"};
	runtime<ratio>(K2) = 0.9;
	repetition_count(K2) = bufferRepetition_2;
	
	…
	};
};

The external buffers and shared buffers are then created. The repetition_count method is used again to define the number of frame read/write per iteration:

// External Buffers (in External Memory)
// Size, number of input ports, number of output ports
ddrin = external_buffer<uint32>::create(ddr_size1, 0, 1);
ddrout = external_buffer<uint32>::create(ddr_size2, 1, 0);

// Shared Buffers (in Memory Tiles)
// Size, number of input ports, number of output ports
mtx1 = shared_buffer<uint32_t>::create(shared_mem_size1,1,1);
repetition_count(mtx1) = NPARTS;

mtx2 = shared_buffer<uint32_t>::create(shared_mem_size2,1,1);
repetition_count(mtx2) = NPARTS;

mtx3 = shared_buffer<uint32_t>::create(shared_mem_size2,1,1);
repetition_count(mtx3) = NPARTS;

// Shared buffers support ping-pong buffering
num_buffers(mtx1) = 2;
num_buffers(mtx2) = 2;
num_buffers(mtx3) = 2;

The AI Engine-ML memory creation process is done automatically with the connection to other elements:

// Connect Input DDR to Input MEM Tile
connect(ddrin.out[0], mtx1.in[0]);

// Specify read access pattern for the data source and the write access pattern for the destination for the DDR --> MEM Tile connection
read_access(ddrin.out[0]) = DDR_pattern1;
write_access(mtx1.in[0]) = MEM_pattern1;

// Kernel 1 connection
connect(mtx1.out[0], K1.in[0]);
// Access pattern can be defined only on the Shared buffer (Memory Tile) within the graph
read_access(mtx1.out[0]) = MEM_pattern1;
dimensions(K1.in[0]) = buffer_size1;

connect(K1.out[0], mtx2.in[0]);
dimensions(K1.out[0]) = buffer_size1;
// Access pattern can be defined only on the Shared buffer (Memory Tile) within the graph
write_access(mtx2.in[0]) = MEM_pattern1;

// Kernel 2 connection
connect(mtx2.out[0], K2.in[0]);
// Access pattern can be defined only on the Shared buffer (Memory Tile) within the graph
read_access(mtx2.out[0]) = MEM_pattern2;
dimensions(K2.in[0]) = buffer_size2;

connect(K2.out[0], mtx3.in[0]);
dimensions(K2.out[0]) = buffer_size2;
// Access pattern can be defined only on the Shared buffer (Memory Tile) within the graph
write_access(mtx3.in[0]) = MEM_pattern2;

// Connect Output MEM Tile to Output DDR
connect(mtx3.out[0], ddrout.in[0]);
// Specify read access pattern for the data source and the write access pattern for the destination for the MEM Tile connection --> DDR
read_access(mtx3.out[0]) = MEM_pattern2;
write_access(ddrout.in[0]) = DDR_pattern2;

On top of the tiling parameters for all inputs and outputs of shared buffers and external buffers, the buffer size of the kernels are also specified within the graph.