AI Engine - 2024.1 English - UG1642

AI Engine System Software Driver Reference Manual (UG1642)

Document ID
UG1642
Release Date
2024-05-30
Version
2024.1 English

The AMD Versalâ„¢ AI Core series introduces a groundbreaking AI Engine technology that sets new standards for AI inference acceleration. These AI Engines are engineered to deliver compute performance exceeding current server-class CPUs by more than 100 times. They are versatile and aptly designed to cater to a wide range of applications, including dynamic cloud workloads and high-bandwidth network operations, all while prioritizing advanced safety and security features.

The AI Engine technology at the core of the Versal AI Core series comprises an array of very-long instruction word (VLIW) processors coupled with single instruction multiple data (SIMD) vector units. This architectural innovation is finely tuned for compute-intensive applications, with a specific focus on digital signal processing (DSP), 5G wireless applications, and artificial intelligence (AI) technologies like machine learning (ML).

A distinguishing feature of AI Engines is their capacity for parallelism. They exhibit multiple levels of parallelism, encompassing both instruction-level and data-level parallelism. Instruction-level parallelism allows for a variety of operations in each clock cycle, including scalar operations, up to two moves, two vector reads (loads), one vector write (store), and one vector instruction, resulting in a 7-way VLIW instruction per clock cycle. Data-level parallelism enables simultaneous processing of multiple sets of data on a per-clock-cycle basis.

Furthermore, each AI Engine includes a vector and scalar processor, dedicated program memory, local 32 KB data memory, and access to local memory in three neighboring directions. These engines are also equipped with DMA engines and AXI4 interconnect switches to facilitate seamless communication via streams to other AI Engines or to the programmable logic (PL) or the DMA. See Versal Adaptive SoC AI Engine Architecture Manual (AM009) for more detailed technical insights.

In addition to the AI Engines, the AI Engine-ML (AIE-ML) block is a noteworthy inclusion in the series. It offers twice the compute throughput compared to its predecessor AI Engine blocks, with a primary focus on machine learning inference applications. The AI Engine-ML block stands out for its industry-leading performance per watt, catering to a wide spectrum of inference applications.

In conclusion, the Versal AI Core series, driven by the remarkable AI Engines and the advanced AIE-ML block, represents a significant leap forward in AI inference acceleration. With their unparalleled compute performance and broad application support, these AI Engines are poised to reshape the landscape for AI, data science, software development, and hardware innovation. Moreover, their ability to optimize wireless applications, such as radio, 5G, backhaul, and high-performance DSP applications, positions them as indispensable assets in today's rapidly evolving technology ecosystem.

Figure 1. AIE-ML Tile Block Diagram

Figure 2. Top Level AI Engine to AIE-ML Array Configuration

Figure 3. AI Engine Driver

The AI Engine software driver enables the setup of AI Engine blocks to create highly adaptable data graphs. The AI Engine system software stack offers the following capabilities

As depicted in the previous diagram, the AI Engine system software driver (often referred to as the AIE SSW Driver) assumes a distinct role in contrast to the adaptive dataflow graph (ADF) and Xilinx Runtime (XRT) programming APIs. While ADF and XRT APIs are designed to streamline the development of AI Engine applications, offering high-level programming interfaces and a suite of tools for creating, optimizing, and deploying algorithms and designs onto AI Engines, the AI Engine system software driver occupies a unique position.

This critical software layer resides between the application code, using ADF and XRT APIs, and the underlying hardware. It takes on essential responsibilities, including the configuration, initialization, and management of AI Engines, and facilitating communication between the APU and AI Engines. Furthermore, it provides the necessary software interfaces to enable seamless interactions between applications and AI Engines.

Device Lookup/Management
This is the software models various versions of device types that populates the corresponding AI Engine device instances that applications are intended for. These instances are used by the other APIs mentioned as follows.
Tile Management
Every AI Engine tile is equipped with multiple resources including a vector processor core, lock, switch, and memory, as depicted previously. The software offers APIs for managing these resources to construct the complete data graph using AI Engine tiles. This encompasses establishing connections and routing between the master and slave ports of each tile and synchronizing operations through tile locks.
DMA Management
Each AI Engine tile features a DMA with dedicated S2MM and MM2S modules for data transfers within the local memory to facilitate stream-based communication with adjacent tiles via stream switches. The DMA controller operates with 32-bit data transfers, orchestrated by a set of buffer descriptors that contain all necessary details for each transaction. These descriptors, accessible through a memory-mapped AXI4 interconnect, enable seamless data flow and synchronization within the AI Engine array, supporting continuous DMA operations across tiles.
Event Handling
The AI Engine modules have the capability to produce and disseminate events to other modules. The AI Engine driver facilitates the assignment, generation, and dissemination of events to neighboring tiles through the event APIs. These events encompass errors. Events can trigger notification signals, including broadcast events. These broadcast events serve multiple purposes, such as initiating AI Engine NPI interrupts or routing them to the PL.

The management of errors and events within the AI Engine driver involves configuring broadcast events to trigger NPI interrupts for errors and specific events that applications require interrupt notifications for. Additionally, the AI Engine driver continuously monitors errors and provides APIs for retrieving error information when they occur.

The AI Engine driver effectively manages the distribution of errors to various applications, ensuring that errors from one application do not affect others.

Additionally, the AI Engine driver offers the capability to abstract errors into groups, establishing a unified error coding system that remains consistent across different device generations.

Debugging and Trace

The AI Engine has the capability to generate traces in response to specific events, offering various modes for this purpose. The AI Engine driver's trace API enables the configuration of trace controls, and the resulting trace can be directed to memory through stream configuration.

Furthermore, AI Engine tiles come equipped with debug configuration registers that enable you to establish breakpoints, watchpoints, and enable single-stepping. They also feature status registers to monitor the AI Engine core and streams. AI Engine drivers offer APIs for configuring these registers and retrieving their status information.

Power Management

The AI Engine employs clock gating as a strategic measure to mitigate power consumption. The AI Engine driver assumes responsibility for overseeing the clock operations of AI Engine tiles. It enables the tiles necessary for the AI Engine data flow graph when it is loaded while ensuring that the clocks of unused tiles remain in a deactivated state. Upon the initial power-on state, the power management (PLM) system automatically subjects all AI Engine tiles to clock gating.

Upon the completion of an application's life cycle, the tiles used by the application are reset, and their clocks are gated. Functionality within the AI Engine System Software driver operates as follows:

  1. The driver enforces clock gating on all unused tile columns within the AI Engine array.
    • For each column, the driver designates the uppermost requested tile in that column.
    • It gates all tiles situated above that uppermost tile in the column while keeping all tiles beneath that tile clock-enabled.
    • If no tile is requested in a column, the column's clock buffer in the AI Engine array interface tile of that column is deactivated. Consequently, all tiles above the AI Engine interface tile in that row are subjected to clock gating.
  2. The driver accepts a list of AI Engine locations in terms of row and column.
    • Any tile falling outside the defined range (greater than NumRows or equal to or exceeding NumCols) is not accepted, and the driver reports an error for locations outside the designated range.
  3. The driver takes NumTiles as a 32-bit unsigned integer.
    • The driver validates that NumTiles is less than or equal to the product of NumCols and ( NumRows + 1), which is 50 times (8 + 1), equal to 450.
    • The inclusion of NumRows + 1 signifies the inclusion of all AI Engine rows and AI Engine interface tile rows.
All tiles located below the uppermost tile in use for each column remain enabled for error handling and event routing purposes. For example:
u32 NumTiles = 6;
XAie_LocType Loc[6];
Loc[0].Col = 0;
Loc[0].Row = 3;
Loc[1].Col = 2;
Loc[1].Row = 4;
Loc[2].Col = 3;
Loc[2].Row = 1;
Loc[3].Col = 4;
Loc[3].Row = 3;
Loc[4].Col = 4;
Loc[4].Row = 5;
Loc[5].Col = 5;
Loc[5].Row = 6;
status = XAie_PmRequestTiles(&DevInst, Loc, NumTiles);
Figure 4. AI Engine Clock Gating

Memory Management
The AI Engine array interface provides the necessary functionality to interface with the rest of the device. The AI Engine array interface has three types of AI Engine interface tiles. There is a one-to-one correspondence of interface tiles for every column of the AI Engine array. The interface tiles form a row and move memory-mapped AXI4 and AXI4-Stream data horizontally (left and right) and also vertically up an AI Engine tile column. The AI Engine interface tiles are based on a modular architecture, but the final composition is device specific. During runtime, the AI Engine system software driver has the capability to employ either SMMU or a Linux kernel driver to enforce limitations on the AI Engine partition, preventing it from accessing memory that has not been allocated to it. For a more detailed analysis of solutions involving SMMU and non-coherent DMA on Linux, see Component ID: AIE_KERNEL_DRIVER.

The AI Engine driver incorporates APIs that enable the applications to allocate shared memory for use by AI Engine interface tiles DMA.

Figure 5. Connecting Interrupts from the AI Engine Array to Other Functional Blocks

Error-Correction Code (ECC) Scrubbing
There are three types of performance counters: runtime event performance counters for the AI Engine modules, runtime memory counters for memory modules, and runtime interface counters for AI Engine interface tiles. These performance counters can be configured to track a variety of events in the AI Engine, the memory module, and the interface tile. Various features like ECC scrubbing, event trace, and profiling can use these performance counters. Performance counters count occurrences of a given event in a profile configuration. The profile feature offers several different configurations of these performance counters that can be dynamically applied at runtime to collect various profiling statistics. The ECC scrubbing is ON by default and it can be turned ON or OFF using the AI Engine compiler option. For more information, see AI Engine Compiler Options in the AI Engine Tools and Flows User Guide (UG1076). When ECC scrubbing is enabled, three counters are available for profiling. When performance counters are used for ECC scrubbing, event trace and profiling in the same execution, allocated performance counters cannot meet the requirements of all the requested features at the same time. The following warning messages indicate this situation:
  • Key Points:
    • The default state of ECC in the AI Engine driver is set to ON. This means that if the XAie_TurnECCOff API is not invoked during compilation, the ECC is enabled by default in the driver.
    • When the XAie_TurnECCOff API is called during the configuration data object (CDO) process, the ECC is disabled.
    • With ECC in the ON state, the AI Engine driver uses performance counter 0 of the core module for all tiles where the executable and linkable format (ELF) is loaded. This counter triggers an event that enables ECC every 106 cycles. When generated, this event activates ECC for the program memory (core module) of a tile, and the same event is broadcasted from the core module to the memory module for data memory.
    • ECC is exclusively enabled for tiles where the ELF is loaded.
    • When ECC is turned ON, it applies to both program memory and data memory, activating ECC for both modules within a tile based on the presence of loaded ELF files.
Figure 6. Figure Warning Message

Reset
  • The AI Engine incorporates control registers that allow you to perform resets at different levels, including the entire AI Engine array, the AI Engine interface tiles, and individual columns.
    Important: An AI Engine reset does not erase data stored in the AI Engine's data or program memory. The reset of the AI Engine array can only be initiated by the PLM.
  • The AI Engine driver offers APIs to reset specific AI Engine partitions. In the event of a power-on reset, the boot PDI procedure requests the PLM to load an AI Engine zeroization binary onto the AI Engine cores, effectively resetting the data memory to a zeroized state. The AI Engine zeroization ELF is embedded within the PLM power domain initialization code. Moreover, the AI Engine driver includes a zeroization function that clears both data and program memory, especially when ECC scrubbing is enabled. This is necessary because ECC scrubbing can detect errors if memory has been gated (inactive) for an extended period of time.

The AI Engine graph configuration can be done in two ways:

  • PDI
  • Application calling the AI Engine driver APIs

The application can select either flow depending on the need. AMD suggests that the configuration flow is PDI. AI Engine graph configuration flow is described in AI Engine Configuration and Termination Flow.

After the graph has been initialized, the application can use the AI Engine driver at runtime for the AI Engine runtime monitoring and control. It is possible to configure AI Engine with PL through control packets or AXI4 access. In this case, if there is no handshake between the PL master and the PS master, there can be conflict. And thus, multiple masters to configure the same AI Engine partition is not supported.