Performance Comparison Between AI Engine/PL and AI Engine/NoC Interfaces - 2024.2 English - UG1603

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2024-11-28
Version
2024.2 English

The AI Engine-ML array interface consists of the PL and NoC interface tiles. The AI Engine-ML array interface tiles manage the two following high performance interfaces.

  • AI Engine-ML to PL
  • AI Engine-ML to NoC

The following image shows the AI Engine-ML array interface structure.

Figure 1. Array Interface Topology

One AI Engine-ML to PL interface tile contains eight streams from the PL to the AI Engine-ML and six streams from the AI Engine-ML to the PL. The following table shows one AI Engine-ML to PL interface tile capacity.

Table 1. AI Engine-ML Array Interface to PL Interface Bandwidth Performance
Connection Type Number of Connections Data Width (bits) Clock Domain Bandwidth per Connection (GB/s) Aggregate Bandwidth (GB/s)
PL to AI Engine-ML array interface 8 64 PL

(500 MHz)

4 32
AI Engine-ML array interface to PL 6 64 PL

(500 MHz)

4 24
Note: All bandwidth calculations in this section assume a nominal 1 GHz AI Engine-ML clock for a -1L speed grade device at VCCINT = 0.70V with the PL interface running at half the frequency of the AI Engine-ML as an example.

The exact number of PL and NoC interface tiles is device-specific. For example, in the XCVE2802 device, there are 38 columns of AI Engine-ML array interface tiles. However, only 28 array interface tiles are available to the PL interface. Therefore, the aggregate bandwidth for the PL interface is approximately:

  • 24 GB/s * 28 = 0.672 TB/s from AI Engine-ML to PL
  • 32 GB/s * 28 = 0.896 TB/s from PL to AI Engine-ML
The number of array interface tiles available to the PL interface and total bandwidth of the AI Engine-ML to PL interface for other devices and across different speed grades is specified in the following documents:
  • Versal AI Core Series Data Sheet: DC and AC Switching Characteristics (DS957)
  • Versal AI Edge Series Data Sheet: DC and AC Switching Characteristics (DS958)
  • Versal Architecture and Product Data Sheet: Overview (DS950)

The input_gmio/output_gmio attribute uses DMA in the AI Engine-ML to NoC interface tile. The DMA has two 32-bit incoming streams from the AI Engine-ML and two 32-bit streams to the AI Engine-ML. In addition, it has one 128-bit memory mapped AXI master interface to the NoC NMU. The performance of one AI Engine-ML to NoC interface tile is shown in the following table.

Table 2. AI Engine-ML to NoC Interface Tile Bandwidth Performance
Connection Type Number of connections Bandwidth per connection (GB/s) Aggregate Bandwidth (GB/s)
AI Engine-ML to DMA 2 4 8
DMA to NoC 1 16 16
DMA to AI Engine-ML 2 4 8
NoC to DMA 1 16 16

The exact number of AI Engine-ML to NoC interface tiles is device-specific. For example, in the XCVE2802 device, there are 12 AI Engine-ML to NoC interface tiles. So, the aggregate bandwidth for the NoC interface is approximately:

  • 8 GB/s * 12 = 96 GB/s from AI Engine-ML to PL
  • 8 GB/s * 12 = 96 GB/s from PL to AI Engine-ML
Note: AI Engine-ML array interface tiles have interfaces to both the PL and NoC. Routing to these interfaces share the same streaming interconnect resources inside the AI Engine-ML array.
When accessing DDR memory, the integrated DDR memory controller (DDRMC) number in the platform limits the performance of DDR memory read and write. For example, if all four DDRMCs in a XCVE2802 device are fully used with LPDDR4, the hard limit to access DDR memory is as follows.
  • 3733 Mb/s * 32 bit * 4 DDRMCs / 8 = 59.728 GB/s

The performance of input_gmio/output_gmio accessing DDR memory through the NoC is further restricted by the NoC lane number in the horizontal and vertical NoC, inter-NoC configurations, and QoS. Note that DDR memory read and write efficiency is largely affected by the access pattern and other overheads. For more information about the NoC, memory controller use, and performance numbers, see the Versal Adaptive SoC Programmable Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).

For a single connection from the AI Engine-ML or to the AI Engine-ML, both input_plio/output_plio and input_gmio/output_gmio have a hard bandwidth limit of 4 GB/s. Some advantages and disadvantages for choosing input_plio/output_plio or input_gmio/output_gmio are shown in the following table.

Table 3. Comparison of input_plio/output_plio vs input_gmio/output_gmio
input_plio/output_plio input_gmio/output_gmio
Advantages
  • Number of AI Engine-ML to PL interface streams are larger, hence larger aggregate bandwidth
  • No interference between different stream connections
  • Supports packet switching
  • No PL resource required
  • No timing closure requirement
Disadvantages
  • Congestion risk if there are too many stream connections in a region of the device
  • Timing closure required for achieving best performance
  • Less input_gmio/output_gmio ports available
  • Aggregate bandwidth is lower
  • Multiple input_gmio/output_gmio ports competing for bandwidth