The SIMD VLIW AI Engine-ML comes as an array of interconnected processors using AXI-Stream interconnect blocks as shown in the following figure:
Differences can be seen at this level compared to the AI Engine that is in the Versal™ AI Core devices:
At the bottom of the processor array there is 1 (or 2 depending on the device) rows of 512KB memories. These memories can be accessed by the PL and the AI Engine-ML processors through the AXI-Stream interconnect network. DMA channels of 1 memory block has also access to neighbor memories. These memories are called ‘shared memories’
AI Engine-ML tiles are all oriented the same way
Cascade stream is always left-to -right, but also top-to-bottom
Neighborhood structure does not depend anymore on the row index
These devices being intended for Machine Learning Inference they have been optimized for this kind of applications:
Supported datatype list is:
(u)int4, (u)int8, (u)int16, bfloat16
Number of 8-bit x 8-bit multipliers doubled
Support for 4-bit x 8-bit multiplication (4x more than in previous architecture)
bfloat16: 8-bit exponent, 8-bit mantissa –> keeps dynamic but with less mantissa precision that in the standard float32 (SPFP).
Pipeline is optimized for tensor product
permute blocks are no more full crossbars but are limited to specific data selection (tensor products and convolution)
AI Engine-ML processors have now access to their own registers. They can program the DMAs of their local memories.
Local memory is now 64KB long, always with 8x 128-bit wide banks.
Compute Performance is doubled in 8x8 and 16x16 and quadrupled in 4x8.