To programme the AI Engine array effectively, you must have a thorough understanding of the following:
- The algorithm you are implementing
- The capabilities of the AI Engines
- The overall data flow between individual functional units
The AI Engine array supports the following three levels of parallelism:
- SIMD
- Compute multiple elements in parallel using vector registers.
- Instruction level
- Execute multiple instructions in a single clock cycle using the VLIW architecture.
- Multicore
- Through the AI Engine array, where many AI Engines (from less than ten to several hundred) can execute in parallel.
You can compile most standard C/C++ code for the AI Engine. However, the code can require restructuring for optimal performance on the AI Engine array.
The power of an AI Engine is its ability to perform the following:
- Execute a vector MAC operation
- Load two 256-bit vectors for the next operation
- Store a 256-bit vector from the previous operation
- Increment a pointer or execute another scalar operation in each clock cycle
To make use of the vector processor, the code needs to use AIE APIs or intrinsic functions and be structured for pipelined vector
operations. The AI Engine compiler does not perform
any auto or pragma-based vectorization. The code must be rewritten to use SIMD intrinsic
data types (for example, v8int32) and vector intrinsic
functions (for example, mac(…)). You must execute the
data types and functions within a pipelined loop to achieve optimal performance. The
32-bit scalar RISC processor has an ALU, some non-linear functions, and data type
conversions. Each AI Engine has access to a limited
amount of memory, so large data sets need to be partitioned.
AI Engine kernels are functions that run on an AI Engine, and form the fundamental building blocks of a data flow graph specification. The data flow graph is a modified Kahn process network with deterministic behavior that does not depend on the various computational or communication delays. AI Engine kernels are declared as void C/C++ functions that take buffer or stream arguments for graph connectivity. Kernels can also have static data and runtime parameter arguments that can be either asynchronous or triggering. Define each kernel in its own source file.
Achieving overall system performance requires additional reading and experience in the following areas:
- Architecture
- Partitioning
- AI Engine data flow graph generation
- Optimizing data flow connectivity
For more detailed information see the following documents:
- Versal Adaptive SoC AI Engine Architecture Manual (AM009)
- Versal Adaptive SoC AIE-ML Architecture Manual (AM020)
AMD provides DSP and communications libraries which include optimized code for the AI Engine. Use these whenever possible. The supplied source code is also a great resource for learning about AI Engine kernel coding.