The top-level block diagram of the AI Engine tile architecture, key building blocks, and connectivity for the AI Engine tile are shown in the following figure.
The AI Engine tile consists of the following high-level modules:
- Tile interconnect
- AI Engine
- AI Engine memory module
The tile interconnect module handles AXI4-Stream and memory mapped AXI4 input/output traffic. The memory-mapped AXI4 and AXI4-Stream interconnect is further described in the following sections. The AI Engine memory module has 32 KB of data memory divided into eight memory banks, a memory interface, DMA, and locks. There is a DMA in both incoming and outgoing directions and there is a Locks block within each memory module. The AI Engine can access memory modules in all four directions as one contiguous block of memory. The memory interface maps memory accesses in the right direction based on the address generated from the AI Engine. The AI Engine has a scalar processor, a vector processor, three address generators, and 16 KB of program memory. It also has a cascade stream access for forwarding accumulator output to the next AI Engine tile. The AI Engine is described in more detail in AI Engine Architecture. Both the AI Engine and the AI Engine memory module have control, debug, and trace units. Some of these units are described later in this chapter:
- Control and status registers
- Events, event broadcast, and event actions
- Performance counters for profiling and timers
The following figure shows the AI Engine array with the AI Engine tiles and the dedicated interconnect units arrayed together. Sharing data with local memory between neighboring AI Engines is the main mechanism for data movement within the AI Engine array. Each AI Engine can access up to four memory modules:
- Its own
- The module on the north
- The module on the south
- The module on the east or west depending on the row and the relative placement of AI Engine and memory module
The AI Engines on the edges of the array have access to one or two fewer memory modules, following a checkerboard pattern.
Together with the flexible and dedicated interconnects, the AI Engine array provides deterministic performance, low latency, and high bandwidth. The modular and scalar architecture allows more compute power as more tiles are added to the array.
The cascade streams travel from tile to tile in horizontal manner from the bottom row to the top. As a cascade stream reaches the edge at one end, it is connected to the input of the tile above it. Therefore, the flow changes direction on alternate rows (west to east on one row, and east to west on another). The cascading continues until it reaches one end of the top row at which point the stream ends with no further connection. Because of the change in direction, the relative placement of the AI Engine and memory module in a tile is reversed from one row to the other.