Single Kernel Programming - 2024.1 English

AI Engine-ML Kernel and Graph Programming Guide (UG1603)

Document ID
UG1603
Release Date
2024-06-06
Version
2024.1 English

An AI Engine-ML kernel is a program which is written using the C/C++ language, AI Engine API and specialized intrinsic functions that target the VLIW scalar and vector processors. The AI Engine-ML kernel code is compiled using the AI Engine compiler, which is included in the AMD Vitis™ core development kit. The AI Engine compiler compiles the kernels to produce ELF files that are run on the AI Engine-ML processors.

The AI Engine-ML supports specialized data types and API functions for vector processing. By restructuring some scalar application code with these API functions and vector data types, one can create fast and efficient vectorized code. The AI Engine compiler takes care of mapping API functions to operations, vector, or scalar register allocation and data movement, automatic scheduling, and generation of microcode that is efficiently packed into VLIW instructions.

The following chapters introduce the data types supported and registers available for use by the AI Engine-ML kernel. In addition, the vector API functions that initialize, load, and store, as well as operate on the vector registers using the appropriate data types are also described.

To achieve the highest performance on the AI Engine-ML, the primary goal of single kernel programming is to ensure that the usage of the vector processor approaches its theoretical maximum. Vectorization of the algorithm is important, but managing the vector registers, memory access, and software pipelining are also required. The programmer must strive to make the data for the new operation available while the current operation is executing because the vector processor is capable of an operation every clock cycle. Optimizations using software pipelining in loops are available using pragmas. For instance, when the inner loop has sequential or loop carried dependencies it might be possible to unroll an outer loop and compute multiple values in parallel. The following sections go over these concepts as well.