AI Engine Tile Architecture

AI Engine Tile Architecture - 2020.2 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID

UG1076

Release Date

2020-11-24

Version

2020.2 English

The AI Engine array consists of a 2D array of AI Engine tiles, where each AI Engine tile contains an AI Engine, memory module, and tile interconnect module. An overview of such an AI Engine tile is shown in the following figure.

AI Engine: Each AI Engine is a very long instruction word (VLIW) processor containing a scalar unit, a vector unit, two load units, and a single store unit.
AI Engine Tile: An AI Engine tile contains an AI Engine, a local memory module together with several communication paths to facilitate data exchange between tiles.
AI Engine Array: AI Engine array refers to the complete 2D array of AI Engine tiles.
AI Engine Program: The AI Engine program consists of a data-flow graph specification which is written in C/C++. This program is compiled and executed using the AI Engine tool chain.
AI Engine Kernels: Kernels are written in C/C++ using AI Engine vector data types and intrinsic functions. These are the computation functions running on an AI Engine. The kernels form the fundamental building blocks of a data-flow graph specification.

Figure 1. AI Engine Tile Block Diagram

The following illustration is the architecture of a single AI Engine.

Figure 2. AI Engine

Each AI Engine is a very long instruction word (VLIW) processor containing a scalar unit, a vector unit, two load units, and one store unit. The main compute power is provided by the vector unit. The vector unit contains a fixed-point unit with 128 8-bit fixed-point multipliers and a floating-point unit with eight single-precision floating-point multipliers. The vector registers and permute network are shared between the floating-point and fixed-point vector units. The peak performance depends on the size of the data types used by the operands. The following table provides the number of MAC operations that can be performed by the vector processor per instruction.

Table 1. Supported Precision Bit Width of the Vector Datapath
X Operand	Z Operand	Output	Number of MACs
8 real	8 real	48 real	128
16 real	8 real	48 real	64
16 real	16 real	48 real	32
16 real	16 complex	48 complex	16
16 complex	16 real	48 complex	16
16 complex	16 complex	48 complex	8
16 real	32 real	48/80 real	16
16 real	32 complex	48/80 complex	8
16 complex	32 real	48/80 complex	8
16 complex	32 complex	48/80 complex	4
32 real	16 real	48/80 real	16
32 real	16 complex	48/80 complex	8
32 complex	16 real	48/80 complex	8
32 complex	16 complex	48/80 complex	4
32 real	32 real	80 real	8
32 real	32 complex	80 complex	4
32 complex	32 real	80 complex	4
32 complex	32 complex	80 complex	2
32 SPFP	32 SPFP	32 SPFP	8

To calculate the maximum performance for a given datapath, it is necessary to multiply the number of MACs per instruction with the clock frequency of the AI Engine kernel. For example, with 16-bit input vectors X and Z, the vector processor can achieve 32 MACs per instruction. Using the clock frequency for the slowest speed grade results in:

32 MACs * 1 GHz clock frequency = 32 Giga MAC operations/second

In most cases, 32 MACs/instruction remains a theoretical upper bound because the algorithm to be implemented cannot continuously use the full capabilities of the AI Engine or might be constrained by I/O bandwidth.

The main I/O interfaces with respect to reading and writing data to and from the AI Engine for compute are the data memory interfaces, the stream interfaces, and the cascade stream interfaces. A complete list of interfaces including the program memory interface and debug interface are available in Versal ACAP AI Engine Architecture Manual (AM009).

Recommended: Xilinx highly recommends reading Versal ACAP AI Engine Architecture Manual (AM009) prior to starting your AI Engine kernel programming.

The data memory interface sees one contiguous memory consisting of the data memory modules in all four directions with a total capacity of 128 KB. The AI Engine has two 256-bit wide-load units and one 256-bit wide-store unit.
The AI Engine has two 32-bit input AXI4-Stream interfaces and two 32-bit output AXI4-Stream interfaces. Each of these streams allow the AI Engine to have a 128-bit access every four clock cycles or a 32-bit wide access per cycle.
The 384-bit accumulator data from one AI Engine can be forwarded to the neighboring AI Engine by using the cascade stream interfaces to form a chain. The cascade stream interface is uni-directional and its direction depends on the row where the AI Engine is located. There is a small, two deep, 384-bit wide FIFO on both the input and output streams that allow storing up to four values between AI Engines. Each cycle 384-bits can be sent and received by the chained AI Engines.

The program memory size on the AI Engine is 16 KB, which allows storing 1024 instructions of 128-bit each. The AI Engine instructions are 128-bits wide and support multiple instruction formats and variable length instructions to reduce the program memory size. Many instructions outside of the optimized inner loop can use the shorter formats.