AI Engine-ML processor array

AI Engine-ML processor array - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID

XD100

Release Date

2024-03-05

Version

2023.2 English

The SIMD VLIW AI Engine-ML comes as an array of interconnected processors using AXI-Stream interconnect blocks as shown in the following figure:

../../../../_images/AIEML-Grid.png

Differences can be seen at this level compared to the AI Engine that is in the Versal™ AI Core devices:

At the bottom of the processor array there is 1 (or 2 depending on the device) rows of 512KB memories. These memories can be accessed by the PL and the AI Engine-ML processors through the AXI-Stream interconnect network. DMA channels of 1 memory block has also access to neighbor memories. These memories are called ‘shared memories’
AI Engine-ML tiles are all oriented the same way
- Cascade stream is always left-to -right, but also top-to-bottom
- Neighborhood structure does not depend anymore on the row index

These devices being intended for Machine Learning Inference they have been optimized for this kind of applications:

Supported datatype list is:
- (u)int4, (u)int8, (u)int16, bfloat16
- Number of 8-bit x 8-bit multipliers doubled
- Support for 4-bit x 8-bit multiplication (4x more than in previous architecture)
- bfloat16: 8-bit exponent, 8-bit mantissa –> keeps dynamic but with less mantissa precision that in the standard float32 (SPFP).
Pipeline is optimized for tensor product
- permute blocks are no more full crossbars but are limited to specific data selection (tensor products and convolution)
- AI Engine-ML processors have now access to their own registers. They can program the DMAs of their local memories.
- Local memory is now 64KB long, always with 8x 128-bit wide banks.

Compute Performance is doubled in 8x8 and 16x16 and quadrupled in 4x8.