The AIE-ML memory tile is introduced in the AIE-ML architecture to significantly increase the on-chip memory inside the AIE-ML array. The memory tile reduces the utilization of PL resources (LUTs, block RAMs and URAMs) in ML applications. It is similar to the AIE-ML tile but without the AIE-ML processor and program memory. The AIE-ML memory tile contains high-density (512 KB) and high bandwidth memory, and an integrated DMA to access local memory and neighboring memories. The AIE-ML memory tile only has vertical streaming interfaces (no cascade or horizontal). A subset of DMA channels can directly access memory in the nearest neighboring memory tiles to the East and West. The following figure shows the AIE-ML memory tile architecture.
The memory tile has the following functional blocks. They are either the same or similar to the equivalent blocks in the AIE-ML tile:
- Memory
- DMA
- Locks
- AXI4-Stream switch
- Memory-mapped AXI4 switch
- Control, debug, and trace
- Events and event broadcast
The following is a list of AIE-ML memory tile features:
- Memory
- 512 KB memory arranged into 16 banks (each 128-bit wide and 2k words deep), ECC protected
- The memory banks in the AIE-ML memory tile initializes to zero at boot and reset
- Supports up to 30 GB/s read and 30 GB/s write in parallel per memory tile
- DMA
- Memory to stream DMA (MM2S) with six channels
- 6 x 32-bit stream interfaces
- 6 x 128-bit memory interfaces
- 5D tensor address generation (including iteration-offset)
- Support inserting zero padding into stream data and compression
- Access memory and locks in east/west neighboring tiles (channels 0–3)
- Support task queue and task-complete-tokens; queue depth is four tasks per channel (see Task-Completion-Tokens for more information)
- Stream to memory DMA (S2MM) with six channels
- 6x32-bit stream interfaces
- 6x128-bit memory interfaces
- 5D tensor address generation (including iteration-offset)
- Support out-of-order packet transfer, finish-on-TLAST, and decompression
- Access memory and locks in east/west neighboring tiles (channel 0-3)
- Support task queue and task-complete-tokens; queue depth is four tasks per channel (see Task-Completion-Tokens for more information)
- Buffer descriptors (BD)
- 48 shared BDs
- Each channel can access 24 BDs and each BD can be accessed by six channels
- Stream Switch
- Share the same design as AIE-ML tile. 17 master and 18 slave ports
- North and South ports but no east and west streams
- Trace and control ports
- Lock Module
- Accessible from neighboring AIE-ML memory tile DMA channels; there are 64 semaphore locks and each lock state is 6-bit unsigned
- Additional control and status registers
- Events, event actions, event broadcast, combo events
- Task-complete-tokens logic (see Task-Completion-Tokens for more information)
- Configuration/debug interconnect (memory-mapped AXI4)
- 1 MB address space per tile
- Write bandwidth improvement and stream control-packet support
- Debug and Trace
- Similar to that in AIE-ML tile
- Event trace stream; 4x performance counters and 64-bit tile timer
- Memory to stream DMA (MM2S) with six channels