AI Engine Throughput and Latency - 2023.2 English

Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504)

Document ID
UG1504
Release Date
2023-11-15
Version
2023.2 English

In most applications, high throughput with low latency is desirable. Versal adaptive SoCs can help you achieve this goal, especially with the AI Engine. However, usually an application has a more stringent requirement for either high throughput or low latency, and it might be necessary to make some trade-offs towards meeting the more stringent requirement.

For example, in the case of FIR filters, it is possible to use the cascade stream within the AI Engine tile to increase the overall throughput by splitting the compute across multiple AI Engine tiles. However, this might increase the overall latency of the FIR. For more information, see Simple Filter Chain Example.

The data movement into, out of, and around the AI Engine array directly contributes to the latency of your system, for example, using window or streaming interfaces for your kernel. Depending on the window size, this can introduce an additional delay when loading the input, because the AI Engine waits for the full window to be available before starting the compute. With a streaming solution, the compute starts while the data is being delivered. Streaming interfaces are common for high data rate designs. However, this can lead to additional PL functionality being required to sort and align the data for streaming. In the case of the windowing solution, ping-pong buffering can help mitigate against the increased latency.

When communicating between the PL and AI Engine array, you must be aware of additional latency in your system if the PL is running more slowly than the AI Engine. In this case, you must use wider buses, (32, 64, or 128-bit) to bring the data into the array. This adds latency to the overall system, because the AXI4-Stream interconnect to the AI Engine memory and core is 32-bits, which adds 4 cycles to consume a 128-bit bus.

Tip: Use 64-bit PLIO at 500 MHz for best bandwidth utilization when feeding the AI Engine tiles in a -1 device where the AI Engine is running at 1 GHz.

For data processing in the PL, it is important to consider the DSP Engines and PL for small latency critical functions. This eliminates the need to bring the data in and out of the AI Engine array, which could be costly if the latency requirements are small.

Another consideration is how you intend to control your function or application within the AI Engine array. Will you use the PS or PL to control the kernel functionality at run time? Can the PS meet the latency requirement, or do you need a PL controller? These are important factors which need to be evaluated for your specific application.

For more information, see the following documents:

Note: For information on NoC latency, see this link in the Versal Adaptive SoC Programmable Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).