The buffers used in the various kernels are inferred automatically at the graph level as ping-pong buffers located in the AI Engine-ML data memory. From the kernel point of view, the buffers are accessed through the address generator units of the processor and are managed by the kernel.
Input/output buffers can be filled and flushed by other parts of the device through streams handled internally by a DMA. In order to simplify kernel addressing, the user might require the data to be stored in a specific way in the data memory. It could be a kind of transposition of a tensor to simplify matrix multiplication at the kernel level. You cannot currently define specific access patterns for this kind of memory, no tiling parameter can be defined leading to a simple linear addressing for local memories. If a specific ordering is needed in the local memory, this must be taken into account by the transmitter. Similarly, when the local memory is flushed, the destination must take into account the specific data ordering of the local memory.
class MatrixMultiply : public graph {
public:
input_port inA,inB;
output_port outC;
kernel MatMult;
MatrixMultiply() {
// kernel
MatMult = kernel::create(ClassicMatMult);
source(MatMult) = "src/matmult.cpp";
runtime<ratio>(MatMult) = 0.9;
// Connect Input A to MatMult Kernel
connect(inA, MatMult.in[0]);
dimensions(MatMult.in[0]) = {NColsA,NRowsA};
// Connect Input B to MatMult Kernel
connect(inB, MatMult.in[1]);
dimensions(MatMult.in[1]) = {NColsB,NRowsB};
// Connect MatMult Kernel to Output
connect(MatMult.out[0],outC);
dimensions(MatMult.out[0]) = {NColsC,NRowsC};
};
};
dimensions(MatMult.in[0]) = {NColsA,NRowsA};
This size definition can also be done directly inside the kernel.