AI Engine Architecture Overview - 2025.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2025-11-20
Version
2025.2 English

To programme the AI Engine array effectively, you must have a thorough understanding of the following:

  • The algorithm you are implementing
  • The capabilities of the AI Engines
  • The overall data flow between individual functional units

The AI Engine array supports the following three levels of parallelism:

SIMD
Compute multiple elements in parallel using vector registers.
Instruction level
Execute multiple instructions in a single clock cycle using the VLIW architecture.
Multicore
Through the AI Engine array, where many AI Engines (from less than ten to several hundred) can execute in parallel.

You can compile most standard C/C++ code for the AI Engine. However, the code can require restructuring for optimal performance on the AI Engine array.

The power of an AI Engine is its ability to perform the following:

  • Execute a vector MAC operation
  • Load two 256-bit vectors for the next operation
  • Store a 256-bit vector from the previous operation
  • Increment a pointer or execute another scalar operation in each clock cycle

To make use of the vector processor, the code needs to use AIE APIs or intrinsic functions and be structured for pipelined vector operations. The AI Engine compiler does not perform any auto or pragma-based vectorization. The code must be rewritten to use SIMD intrinsic data types (for example, v8int32) and vector intrinsic functions (for example, mac(…)). You must execute the data types and functions within a pipelined loop to achieve optimal performance. The 32-bit scalar RISC processor has an ALU, some non-linear functions, and data type conversions. Each AI Engine has access to a limited amount of memory, so large data sets need to be partitioned.

AI Engine kernels are functions that run on an AI Engine, and form the fundamental building blocks of a data flow graph specification. The data flow graph is a modified Kahn process network with deterministic behavior that does not depend on the various computational or communication delays. AI Engine kernels are declared as void C/C++ functions that take buffer or stream arguments for graph connectivity. Kernels can also have static data and runtime parameter arguments that can be either asynchronous or triggering. Define each kernel in its own source file.

Achieving overall system performance requires additional reading and experience in the following areas:

  • Architecture
  • Partitioning
  • AI Engine data flow graph generation
  • Optimizing data flow connectivity

For more detailed information see the following documents:

  • Versal Adaptive SoC AI Engine Architecture Manual (AM009)
  • Versal Adaptive SoC AIE-ML Architecture Manual (AM020)

AMD provides DSP and communications libraries which include optimized code for the AI Engine. Use these whenever possible. The supplied source code is also a great resource for learning about AI Engine kernel coding.