Synchronous Buffer Port Access - 2023.2 English

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2023-12-04
Version
2023.2 English

A kernel reads from its input buffers and writes to its output buffers. By default, the synchronization that is required to wait for an input buffer of data is performed before entering the kernel. The synchronization that is required to provide an empty output buffer is also performed before entering the kernel. There is no synchronization needed for synchronous buffer ports within the kernel to read or write the individual samples of data after the kernel has started execution.

Buffer port size can be declared via dimensions() API or with kernel function prototype.

Option 1
Configure with dimensions() API in graph.
connect netN(in.out[0], k.in[0]);
dimensions(k.in[0])={INPUT_SAMPLE_SIZE};
Option 2
Configure with kernel function prototype where function prototypes are declared in kernel header files and which is referenced in graph code.

Graph code specifies connection.

connect netN(in.out[0], k.in[0]);

Kernel code specifies data type and buffer size.

void simple(input_buffer<int32, adf::extents<INPUT_SAMPLE_SIZE>> & in,
          output_buffer<int32, adf::extents<OUTPUT_SAMPLE_SIZE>> & out);
In the following example, a kernel located in tile 1 uses a ping-pong buffer for writing, and the kernel located in tile 2, which is adjacent to tile 1, uses the same ping-pong buffer for reading. The two kernels and two main functions do not have the same execution time leading to some processor stalling during the runtime. The overall mechanism is that kernel 1 writes onto the ping buffer while kernel 2 reads from pong buffer. In the following figure, iteration kernel 1 writes onto the pong buffer, and kernel 2 reads from the ping buffer.
Figure 1. Lock Mechanism For Synchronous Ping-pong Buffer Access

The kernel's buffer lock mechanism is handled in the tiles main function. The kernel starts only when all input and output buffers have been locked for reading and writing respectively. The minimum latency for a lock acquisition is seven clock cycles if the buffer is ready to be acquired. If it's already locked by another kernel, it stalls until it becomes available (indicated in red in the figure).

You can see in the diagram that the lock acquisition occurs alternatively on the ping then pong buffer. The selection of the ping or pong buffer is automatic, no user decision is needed at this point.