Programming the AI Engine array requires a thorough understanding of the algorithm to be implemented, the capabilities of the AI Engines, and the overall data flow between individual functional units. The AI Engine array supports three levels of parallelism:
- SIMD
- Through vector registers that allow multiple elements to be computed in parallel.
- Instruction level
- Through the VLIW architecture that allows multiple instructions to be executed in a single clock cycle.
- Multicore
- Through the AI Engine array, where many hundreds of AI Engines can execute in parallel.
While most standard C code can be compiled for the AI Engine, the code might need substantial restructuring to achieve
optimal performance on the AI Engine array. The power
of an AI Engine is its ability to execute a vector
MAC operation, load two 256-bit vectors for the next operation, store a 256-bit vector
from the previous operation, and increment a pointer or execute another scalar operation
in each clock cycle. The AI Engine compiler does not
perform any auto or pragma-based vectorization. The code must be rewritten to use SIMD
intrinsic data types (for example, v8int32
) and vector
intrinsic functions (for example, mac(…)
), and these
must be executed within a pipelined loop to achieve the optimal performance. The 32-bit
scalar RISC processor has an ALU, some non-linear functions, and data type conversions.
Each AI Engine has access to a limited amount of
memory, this means that large data sets need to be partitioned.
AI Engine kernels are functions that run on an AI Engine, and form the fundamental building blocks of a data flow graph specification. The data flow graph is a Kahn process network with deterministic behavior that does not depend on the various computational or communication delays. AI Engine kernels are declared as void C/C++ functions that take window or stream arguments for graph connectivity. Kernels can also have static data and run-time parameter arguments that can be either asynchronous or triggering. Each kernel should be defined in its own source file.
To achieve overall system performance, additional reading and experience is required with respect to the architecture, partitioning, as well as with the AI Engine data flow graph generation and optimizing data flow connectivity. The Versal Adaptive SoC AI Engine Architecture Manual (AM009) contains more detailed information.
AMD provides DSP and communications libraries with optimized code for the AI Engine that should be used whenever possible. The supplied source code is also a great resource for learning about AI Engine kernel coding.