Programming the AI Engine array requires a thorough understanding of the algorithm to be implemented, the capabilities of the AI Engines, and the overall data flow between individual functional units. The AI Engine array supports three levels of parallelism:
- SIMD
- Through vector registers that allow multiple elements to be computed in parallel.
- Instruction level
- Through the VLIW architecture that allows multiple instructions to be executed in a single clock cycle.
- Multicore
- Through the AI Engine array, where many AI Engines (from less than ten to several hundreds) can execute in parallel.
While most standard C/C++ code can be compiled for the AI Engine, the code might require substantial restructuring to achieve
optimal performance on the AI Engine array. The power
of an AI Engine is its ability to execute a vector
MAC operation, load two 256-bit vectors for the next operation, store a 256-bit vector
from the previous operation, and increment a pointer or execute another scalar operation
in each clock cycle. To make use of the vector processor, the code needs to use AIE APIs
or intrinsic functions and be structured for pipelined vector operations.The AI Engine compiler does not perform any auto or
pragma-based vectorization. The code must be rewritten to use SIMD intrinsic data types
(for example, v8int32
) and vector intrinsic functions
(for example, mac(…)
), and these must be executed
within a pipelined loop to achieve optimal performance. The 32-bit scalar RISC processor
has an ALU, some non-linear functions, and data type conversions. Each AI Engine has access to a limited amount of memory, this
means that large data sets need to be partitioned.
AI Engine kernels are functions that run on an AI Engine, and form the fundamental building blocks of a data flow graph specification. The data flow graph is a modified Kahn process network with deterministic behavior that does not depend on the various computational or communication delays. AI Engine kernels are declared as void C/C++ functions that take buffer or stream arguments for graph connectivity. Kernels can also have static data and runtime parameter arguments that can be either asynchronous or triggering. Each kernel should be defined in its own source file.
To achieve overall system performance, additional reading and experience is required with respect to the architecture, partitioning, as well as with the AI Engine data flow graph generation and optimizing data flow connectivity. More detailed information is found in the Versal Adaptive SoC AI Engine Architecture Manual (AM009), and the Versal Adaptive SoC AIE-ML Architecture Manual (AM020).
AMD provides DSP and communications libraries with optimized code for the AI Engine that should be used whenever possible. The supplied source code is also a great resource for learning about AI Engine kernel coding.