The AI Engine array consists of a 2D array of AI Engine tiles, where each AI Engine tile contains an AI Engine, memory module, and tile interconnect module. An overview of such a 2D array of AI Engine tiles is shown in the following figure.
The memory module is shared between its north, south, east, or west AI Engine neighbors, depending on the location of the tile within the array. An AI Engine can access its north, south, east, or west, and its own memory module. Those neighboring memory modules are accessed by AI Engine through dedicated memory access interfaces, and each of the access can be at most 256-bit wide. AI Engine can also send or receive cascade streaming data from neighboring AI Engine. The cascade stream is one-way stream from left to right or right to left in a horizontal manner which wraps around when moving to the next row. The AXI4 interconnect module provides streaming connections between AI Engine tiles and provides stream to memory (S2MM) or memory to stream (MM2S) connections between streaming interfaces and the memory module. In addition, the interconnect modules are also connected to the neighboring interconnect module to provide flexible routing capability in a grid like fashion.
The following illustration is the architecture of a single AI Engine tile.
Each AI Engine tile has an AXI4-Stream switch that is a fully programmable 32-bit AXI4-Stream crossbar. It supports both circuit-switched and packet-switched streams with back-pressure. Through MM2S DMA and S2MM DMA, the AXI4-Stream switch provides stream access from and to AI Engine data memory. The switch also contains two 16-deep 33-bit (32-bit data + 1-bit TLAST) wide FIFOs, which can be chained to form a 32-deep FIFO by circuit-switching the output of one of the FIFOs to the other FIFO’s input.
As shown in the following figure, the AI Engine is a highly-optimized processor featuring a single-instruction multiple-data (SIMD) and very long instruction word (VLIW) processor containing a scalar unit, a vector unit, two load units, a single store unit, and an instruction fetch and decode unit. One VLIW instruction can support a maximum of two loads, one store, one scalar operation, one fixed-point or floating-point vector operation, and two move instructions.
The AI Engine also has three address generator units (AGUs) to support multiple addressing modes. Two of the AGUs are dedicated for the two load units and one AGU is dedicated for the store unit.
Additional details about vector processing unit, AI Engine memory, and AI Engine tile interface can be found in the following sections.