Performance Comparison Between AI Engine/PL and AI Engine/NoC Interfaces

Performance Comparison Between AI Engine/PL and AI Engine/NoC Interfaces - 2022.2 English

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID

UG1079

Release Date

2022-10-19

Version

2022.2 English

The AI Engine array interface consists of the PL and NoC interface tiles. The AI Engine array interface tiles manage the two following high performance interfaces.

AI Engine to PL
AI Engine to NoC

The following image shows the AI Engine array interface structure.

Figure 1. AI Engine Array Interface Topology

One AI Engine to PL interface tile contains eight streams from the PL to the AI Engine and six streams from the AI Engine to the PL. The following table shows one AI Engine to PL interface tile capacity.

Table 1. AI Engine Array Interface to PL Interface Bandwidth Performance
Connection Type	Number of Connections	Data Width (bits)	Clock Domain	Bandwidth per Connection (GB/s)	Aggregate Bandwidth (GB/s)
PL to AI Engine array interface	8	64	PL (500 MHz)	4	32
AI Engine array interface to PL	6	64	PL (500 MHz)	4	24

Note: All bandwidth calculations in this section assume a nominal 1 GHz AI Engine clock for a -1L speed grade device at VCCINT = 0.70V with the PL interface running at half the frequency of the AI Engine as an example.

The exact number of PL and NoC interface tiles is device-specific. For example, in the VC1902 device, there are 50 columns of AI Engine array interface tiles. However, only 39 array interface tiles are available to the PL interface. Therefore, the aggregate bandwidth for the PL interface is approximately:

24 GB/s * 39 = 0.936 TB/s from AI Engine to PL
32 GB/s * 39 =1.248 TB/s from PL to AI Engine

The number of array interface tiles available to the PL interface and total bandwidth of the AI Engine to PL interface for other devices and across different speed grades is specified in Versal AI Core Series Data Sheet: DC and AC Switching Characteristics (DS957).

The input_gmio/output_gmio attribute uses DMA in the AI Engine to NoC interface tile. The DMA has two 32-bit incoming streams from the AI Engine and two 32-bit streams to the AI Engine. In addition, it has one 128-bit memory mapped AXI master interface to the NoC NMU. The performance of one AI Engine to NoC interface tile is shown in the following table.

Table 2. AI Engine to NoC Interface Tile Bandwidth Performance
Connection Type	Number of connections	Bandwidth per connection (GB/s)	Aggregate Bandwidth (GB/s)
AI Engine to DMA	2	4	8
DMA to NoC	1	16	16
DMA to AI Engine	2	4	8
NoC to DMA	1	16	16

The exact number of AI Engine to NoC interface tiles is device-specific. For example, in the VC1902 device, there are 16 AI Engine to NoC interface tiles. So, the aggregate bandwidth for the NoC interface is approximately:

8 GB/s * 16 = 128 GB/s from AI Engine to PL
8 GB/s * 16 = 128 GB/s from PL to AI Engine

When accessing DDR memory, the integrated DDR memory controller (DDRMC) number in the platform limits the performance of DDR memory read and write. For example, if all four DDRMCs in a VC1902 device are fully used, the hard limit to access DDR memory is as follows.

3200 Mb/s * 64 bit * 4 DDRMCs / 8 = 102.4 GB/s

The performance of input_gmio/output_gmio accessing DDR memory through the NoC is further restricted by the NoC lane number in the horizontal and vertical NoC, inter-NoC configurations, and QoS. Note that DDR memory read and write efficiency is largely affected by the access pattern and other overheads. For more information about the NoC, memory controller use, and performance numbers, see the Versal ACAP Programmable Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).

For a single connection from the AI Engine or to the AI Engine, both input_plio/output_plio and input_gmio/output_gmio have a hard bandwidth limit of 4 GB/s. Some advantages and disadvantages for choosing input_plio/output_plio or input_gmio/output_gmio are shown in the following table.

Table 3. Comparison of input_plio/output_plio vs input_gmio/output_gmio
	input_plio/output_plio	input_gmio/output_gmio
Advantages	Number of AI Engine to PL interface streams are larger, hence larger aggregate bandwidth No interference between different stream connections Supports packet switching	No PL resource required No timing closure requirement
Disadvantages	Congestion risk if there are too many stream connections in a region of the device Timing closure required for achieving best performance	Less input_gmio/output_gmio ports available Aggregate bandwidth is lower Multiple input_gmio/output_gmio ports competing for bandwidth