On the AI Engine processor, data may be packetized into window buffers, which are mapped to local memory.
Window buffers can be accessed with 256-bit wide load/store operation, hence offering a throughput of up to 256 Gbps (based on 1 GHz AIE clock).
In the case of FIRs, each window is extended by a margin so that the state of the filter at the end of the previous iteration of the window may be restored before new computations begin.
Window buffers are implemented using a ping-pong mechanism, where the consumer kernel would read the ping portion of the buffer while the producer would fill the pong portion of the buffer that would be consumed in the next iteration.
In each iteration run, the kernel operates on a set number of samples from the window buffer - defined by the template parameter TP_INPUT_WINDOW_VSIZE
. To allow the kernel to safely operate on buffered data a mechanism of lock acquires and releases is implemented.
Note
Window interface is not available in Super Sample Rate modes.
Maximizing Throughput
Buffer synchronization requirements introduce a fixed overhead when a kernel is triggered. Therefore, to maximize throughput, the window size should be set to the maximum that the system will allow.
For example, a 4 tap single-rate symmetric FIR with a 2560 sample input/output window operating on int32
data with int16
implemented on AIE can produce an output window buffer in 354 clock cycles, which - taking into account kernel’s startup overhead (around 40 lock cycles) - equates to throughput of close to 6500 MSa/s (based on 1 GHz AIE clock).
Note
To achieve maximum performance, producer and consumer kernels should be placed in adjacent AIE tiles, so the window buffers can be accessed without a requirement for MM2S/S2MM DMA stream conversions.
Latency
Latency of a window-based FIR is predominantly due to the buffering in the input and output windows. Other factors which affect latency are data and type and FIR length, though these tend to have a lesser effect.
For example, a 16 tap single-rate symmetric FIR with a 512 sample input/output window operating on cint16
data with int16
coefficients implemented on AIE will need around 2.56 us (based on 1 GHz AIE clock) before first full window of output samples is available for the consumer to read.
Subsequent iterations will produce output data with reduced latency, due to the nature of ping-pong buffering and pipelined operations.
To minimize the latency, the buffer size should be set to the minimum size that meets the required throughput.