In some situations, if you are not consuming a buffer port worth of
data on every invocation of a kernel, or if you are not producing a buffer port
worth of data on every invocation, then you can control the buffer synchronization
by declaring the kernel port using async
to declare
the async buffer port in kernel function prototype. The example below illustrates
that the kernel simple
uses:
-
ifm
- Synchronous input buffer port.
-
wts
- Asynchronous input buffer port.
-
ofm
- Asynchronous output buffer port.
The declaration below informs the compiler to omit synchronization of
the buffer named wts
upon entry to the kernel. You
must use buffer port synchronization member function shown inside the kernel code before accessing the buffer port using read/write
iterators/references, as shown below.
void simple(adf::input_buffer<uint8>& ifm, adf::input_async_buffer<uint8>& wts, adf::output_async_buffer<uint8>& ofm)
{
...
wts.acquire(); // acquire lock unconditionally inside the kernel
if (<somecondition>) {
ofm.acquire(); // acquire output buffer conditionally
}
... // do some computation
wts.release(); // release input buffer port inside the kernel
if (<somecondition>) {
ofm.release(); // release output buffer port conditionally
}
...
};
The acquire()
member function of the
buffer object wts
performs the appropriate
synchronization and initialization to ensure that the buffer port object is
available for read or write. This function keeps track of the appropriate buffer
pointers and locks to be acquired internally, even if the buffer port is shared
across AI Engine processors
and can be double buffered. This function can be called unconditionally or
conditionally under dynamic control and is potentially a blocking operation. It is
your responsibility to ensure that the corresponding release()
member function is executed sometime later (possibly even in
a subsequent kernel call) to release the lock associated with that buffer object.
Incorrect synchronization can lead to a deadlock in your code.
acquire()
API.In the following example, the kernel located in tile 1 requests a lock acquisition (write access) three times per each run. The kernel located in tile 2 requests a lock acquisition (read access) twice per each run.
The lock acquisition and release is a kernel-only process. The
main
function is not taking care of the buffer
synchronization; buffer synchronization is the user responsibility. Kernel in tile 1
requests three times the access to the ping pong buffer and tile 2 only twice. In
order to balance the number of accesses, tile 1 should be run twice, and tile 2
should be run three times per iteration.
As seen in the figure, the lock acquisition occurs alternatively on the ping then pong buffer. The buffer choice is automatic. No user decision is needed at this point.
The minimum latency for lock acquisition is seven clock cycles during which the kernel is stalled. If the buffer is not available for acquisition, the kernel is stalled for a longer time (as indicated in red in the figure) until the buffer is available. Depending on the application, there might be time intervals where the ping and/or the pong buffer might not be locked at all.
For asynchronous buffer port, the buffer port of the kernel is
acquired and released explicitly by the acquire
and
release
APIs. The asynchronous output buffer
can be released anytime inside the kernel by the release
API, no matter how many samples are written into the buffer by
the kernel. After the port is released, the asynchronous output buffer can be
acquired by its consumer kernel or can be transferred by DMA to its destination,
such as PLIO.
Consider a system with one producer AI Engine kernel and one consumer AI Engine kernel, communicating via asynchronous buffers. Initially, there are two empty buffers between the producer and the consumer.
From the producer's perspective:
Each time the producer wants to write data to a buffer, it must first
call the acquire
API. When acquired, the buffer is
owned by the producer, which can read from or write to it as needed. After finishing
the operation—either in the same iteration or later—it must call the release
API to release the buffer. Once released, the
buffer becomes available to the consumer, increasing the count of full buffers. If
both buffers are full, any subsequent acquire
call
by the producer will block until an empty buffer becomes available.
From the consumer's perspective:
The consumer must also call the acquire
API
before accessing a buffer. After acquiring, it owns the buffer and can read from or
write to it. Once finished, it calls the release
API to release the buffer, making it available for the producer again and increasing
the count of empty buffers. If both buffers are empty, the consumer will be stalled
upon trying to acquire a buffer, until one becomes full.
In this system, PLIO or GMIO can also act as producers or consumers. Data exchange between PLIO/GMIO and the AI Engine is managed by DMA, which handles buffer availability transparently. Data can only be sent or received when the corresponding buffer is ready (that is, empty for writing or full for reading).