AI Engine-ML Memory - 2024.1 English

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
Release Date
2024.1 English

Each AI Engine-ML has 16 KB of program memory, which allows storing 1024 instructions of 128-bit each. The AI Engine-ML instructions are 128 bit (maximum) wide and support multiple instruction formats, as well as variable length instructions to reduce the program memory size. Many instructions outside of an optimized inner loop can use the shorter formats.

Each AI Engine-ML tile has eight data memory banks, where each memory bank is a 512 word x 128-bit single-port memory (for a total of 64 KB). Each AI Engine-ML can access its own data memory in addition to those in its north, south, and west neighbors, for a total of 256 KB of data memory. The stack is placed in data memory. The default sizes for the stack and heap are 1 KB each. Heap size can be automatically computed and adjusted by the compiler when the optimization level is larger than zero (xlopt>=1 for the AI Engine compiler). Stack size and heap size can be changed using compiler options or constraints in the source code. When the tool computed heap size (with xlopt >= 1) is greater than the explicitly specified value, the compiler fails. Refer to the AI Engine Tools and Flows User Guide (UG1076) for more information about stack and heap size usage.

In a logical representation, the 256 KB memory can be viewed as one contiguous 256 KB block or four 64 KB blocks, and each block can be divided into four odd and four even banks. One even bank and one odd bank are interleaved to comprise a double bank. AI Engine-MLs on the edges of the AI Engine-ML array have fewer neighbors and correspondingly less memory available.

Each AI Engine-ML has three address generation units (AGUs) or ports. It can be used for address generation for vector load/store operations. Each memory port operates in 256-bit vector register mode or 32-bit/16-bit/8-bit scalar register mode. The 256-bit port is created by an even and odd pairing of the memory banks. The 8-bit and 16-bit stores are implemented as read-modify-write instructions (minimum memory access granularity is 32 bits). Concurrent operation of all three ports (Address Generation Units) is supported if each port is accessing a different bank.

Data stored in memory is in little endian format.

Each AI Engine-ML has a DMA controller that is divided into two separate modules, S2MM to store stream data to memory (32-bit data) and MM2S to write the contents of the memory to a stream (32-bit data). Both S2MM and MM2S have two independent data channels.

AI Engine-ML Memory Tile

Each column of the AI Engine-ML array has one or two memory tiles depending on the device, located between that column and the AI Engine-ML interface tile.

Each AI Engine-ML memory tile has 512 KB of memory divided in 16 banks (each 128-bit wide and 2k words deep) ECC protected. Each bank allows one read or one write every cycle and can be accessed by nine read interfaces and nine write interfaces. Access to this memory is made through the AXI4-Stream network. Each AI Engine-ML memory tile has:

  • Six Memory to Stream DMAs (MM2S)
  • Six Stream to Memory DMAs (S2MM)
  • Access to East and West Memory Tiles, for a total of 1.5 MB addressable memory
  • 5D addressing (including iteration-offset)
  • Zero-padding insertion features on MM2S DMAs
  • Out-of-order packets and finish on TLAST features on S2MM DMAs

The memory in the memory tile supports configurable bank interleaving. The banks can be addressed in linear or interleaved mode. Data can also be accessed with the AXI-MM interface.

Access to this memory is performed through the AXI4-Stream and AXI4 memory mapped network.

Data are received through the 6x S2MM DMAs as well as 2x S2MM DMA from west and east MEM Tile neighbor and from the AXI-MM interface. Data are sent through the 6x MM2S DMAs as well as 2x MM2S to west and east neighbor MEM Tiles and to the AXI-MM interface.

Local DMAs are programmed through buffer descriptors (48) and managed by semaphores locks (64x 6-bit values).

Memory access is done though four dimensional addressing, the first three being potentially subject to zero-padding. A mechanism of iteration and offset is also integrated making the overall access equivalent to a 5D addressing. For details on address generation, see Memory and DMA Programming.