Overview - 2024.1 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2024-06-27
Version
2024.1 English

AMD Versal™ adaptive system-on-chips (SoCs) combine Scalar Engines, Adaptable Engines, and Intelligent Engines with leading-edge memory and interfacing technologies to deliver powerful heterogeneous acceleration for any application. Most importantly, Versal adaptive SoCs hardware and software are targeted for programming and optimization by data scientists and software and hardware developers. Versal adaptive SoCs are enabled by a host of tools, software, libraries, IP, middleware, and frameworks to enable all industry-standard design flows.

Built on the TSMC 7nm FinFET process technology, the Versal portfolio is the first platform to combine software programmability and domain-specific hardware acceleration with the adaptability necessary to meet today's rapid pace of innovation. The portfolio includes six families of devices uniquely architected to deliver scalability and AI inference capabilities for a host of applications across different markets—from cloud—to networking—to wireless communications—to edge computing and endpoints.

The Versal architecture combines different engine types with a wealth of connectivity and communication capability and a network on chip (NoC) to enable seamless memory-mapped access to the full height and width of the device. Intelligent Engines are SIMD VLIW AI Engines for adaptive inference and advanced signal processing, and DSP Engines for fixed point, floating point, and complex MAC operations. Adaptable Engines are a combination of programmable logic (PL) blocks and memory, architected for high-compute density. Scalar Engines, including Arm® Cortex®-A72 and Cortex-R5F processors, allow for intensive compute tasks.

AI Engines

The Versal AI Core Series delivers breakthrough AI inference acceleration with AI Engines. This series is designed for a breadth of applications, including cloud for dynamic workloads and network for massive bandwidth, all while delivering advanced safety and security features. AI and data scientists, as well as software and hardware developers, can all take advantage of the high compute density to accelerate the performance of any application. Given the AI Engine's advanced signal processing compute capability, it is well-suited for highly optimized wireless applications such as radio, 5G, backhaul, and other high-performance DSP applications.

AI Engines are an array of very-long instruction word (VLIW) processors with single instruction multiple data (SIMD) vector units that are highly optimized for compute-intensive applications, specifically digital signal processing (DSP), 5G wireless applications, and artificial intelligence (AI) technology such as machine learning (ML).

AI Engines are hardened blocks that provide multiple levels of parallelism including instruction-level and data-level parallelism. Instruction-level parallelism includes a scalar operation, up to two moves, two vector reads (loads), one vector write (store), and one vector instruction that can be executed—in total, a 7-way VLIW instruction per clock cycle. Data-level parallelism is achieved via vector-level operations where multiple sets of data can be operated on a per-clock-cycle basis. Each AI Engine contains both a vector and scalar processor, dedicated program memory, local data memory, and can access adjacent local memory in any of three neighboring directions. It also has access to DMA engines and AXI4 interconnect switches to communicate via streams to other AI Engines or to the programmable logic (PL) or the DMA. Refer to the Versal Adaptive SoC AI Engine Architecture Manual (AM009) for specific details on the AI Engine array and interfaces.

The AI Engine-ML (AIE-ML) block is capable of delivering 2x compute throughput compared to its predecessor AI Engine blocks. The AIE-ML block, primarily targeted for machine learning inference applications, delivers one of the industry's best performance per Watt for a wide range of inference applications. Refer to the Versal Adaptive SoC AIE-ML Architecture Manual (AM020) for specific details on the AIE-ML features and architecture.

As an application developer, it is possible to use one of the white box or black box flows for running a ML inference application on AIE-ML. With the white box flow you can integrate custom kernels and dataflow graphs in the AIE-ML programming environment. A black box flow uses performance optimized Neural Processing Unit (NPU) IP from AMD to accelerate ML workloads in the AIE-ML block.

AMD Vitis™ AI is used as a front-end tool that parses the network graph, performs optimization, quantization of the graph, and generates compiled code that can be run on the AIE-ML hardware. The AIE-ML core tile architecture provides support for a variety of precision fixed and floating-point datatypes. The architecture allows for pipe-lined vector processing and incorporates high-density, high-speed on-chip memory that can effectively store on-chip tensors. Additionally, it features versatile datamovers that are adept at handling multi-dimensional tensors in memory. With the proper selection of overlay processor architecture and spatial and temporal distribution of the input/output tensor in the on/off-chip memory, it is possible to achieve high computational efficiency of the AIE-ML processing cores.

AI Engine Architecture Overview provides a high-level overview of the AI Engine architecture, tools, and documents that can be referenced for kernel programming.

AI Engine Kernels

An AI Engine kernel is a C/C++ program which is written using specialized APIs that target the VLIW vector processor. The AI Engine kernel code is compiled using the AI Engine compiler that is included in the AMD Vitis™ core development kit. The AI Engine compiler compiles the kernels to produce an ELF file that is run on the AI Engine processors.

AI Engine Graphs

An AI Engine program requires a data flow graph specification which is written in C++. This specification can be compiled and executed using the AI Engine compiler. An adaptive data flow (ADF) graph application consists of nodes and edges where nodes represent compute kernel functions, and edges represent data connections. Kernels in the application are compiled to run on the AI Engines. Refer to AI Engine Kernel and Graph Programming Guide (UG1079) for more information about how to develop, debug and optimize AI Engine kernels and graphs. It also include information on specialized graph constructs and ways to control the AI Engine graph.

Compiling and Simulating the Program

Compiling an AI Engine Graph Application describes in detail the different types of compilation available with the AI Engine compiler, the options and input files that can be passed in, and the expected output. You can compile the graph and kernels independently, or as part of a larger system, and set up the design to capture and profile event trace data at runtime.

Simulating an AI Engine Graph Application describes the AI Engine simulator in detail, as well as the x86 simulator for functional simulation. The AI Engine simulator simulates the graph application as a standalone entity, or as part of the hardware emulation of a larger system design.

Using the Vitis Unified IDE describes migrating, building, running, and debugging the AI Engine component in the Vitis Unified IDE. The Vitis tools embrace a bottom-up design flow that lets you develop components of a system and then integrate the components into a top-level system application. Refer to Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393) for a description of building the full system project incorporating the different components described above.

Integrating the AI Engine application into a Versal Design using Vitis

The AI Engine kernels and graph developed in the previous steps can be integrated into a larger Versal adaptive SoC system design that can consist of AI Engine kernels, HLS PL kernels, RTL kernels, and the host application. The Vitis compiler builds this larger system. Refer to Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393) for a description of building the full system project incorporating the different components described above.

Profiling and Debugging Designs with AI Engine

Performance Analysis of AI Engine Graph Application during Simulation describes how to extract performance data by performing event tracing when running the hardware emulation build or the hardware build. This data can be used to further optimize the AI Engine kernels and graphs.

Performance Analysis of AI Engine Graph Application on Hardware describes how to profile and extract performance data by performing event tracing when running the design in hardware. Simulating and Debugging the AI Engine Component provides information related to debugging the AI Engine component from within the Vitis Unified IDE.

Debugging System Projects in Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393) shows you how to run and use the debug environment from the command line, or from the Vitis Unified IDE. The evaluation of the system performance and debugging the application are key steps to achieving your objectives for the system.

Controlling the AI Engine Graph

Programming the PS Host Application describes the process of creating a host application to control the graph and PL kernels of the system. When your design is deployed in hardware, you can install drivers that facilitate initializing and controlling the graph execution via a host application running on the PS, or load and run the AI Engine graph at device boot time.

Application-specific AI Engine control code is generated by the AI Engine compiler as part of compiling the AI Engine design graph and kernel code. The AI Engine control code can:

  • Control the initial loading of the AI Engine kernels.
  • Run the graph for several iterations, update the runtime parameters (RTP) associated with the graph, exit and reset the AI Engines.
    Note: A graph can have multiple kernels, input and output ports. The graph connectivity, which is equivalent to the nets in a data flow graph is either between the kernels, between kernel and input ports, or between kernel and output ports, and can be configured as a connection. A graph runs for an iteration when it consumes data samples equal to the window or stream of data expected by the kernels in the graph, and produces data samples equal to the window or stream of data expected at the output of all the kernels in the graph.

The Vitis core development kit provides platforms for building, simulating, debugging, and deploying your AI Engine designs. These platforms target a specific hardware board e.g., VCK190 board or VEK280 board. It enables development of a design including AI Engine and PL kernels with a host application that targets the Linux OS running on the Arm processor in the PS. Designs developed on this platform can be verified using the hardware emulation flow and run on the target hardware board.

AI Engine Methodology

Mapper/Router Methodology describes the mapper and router methodology to be used when handling a failure in the AI Engine compiler during the mapper and/or router phase.

AI Engine Hardware Profile and Debug Methodology describes the five stage profile and debug methodology to use when you run a Versal design with AI Engine graphs in hardware.

Vitis Unified Integrated Design Environment

Important: This document has been updated to reflect the use of the Vitis Unified IDE and the use of the v++ common command line syntax to create AI Engine components. The classic Vitis IDE has been deprecated and will be discontinued in a future release. You can refer to the 2023.1 version of this document for information on using the classic Vitis IDE for creating and debugging an AI Engine graph application or system design. See Migrating Vitis Classic IDE Graph Applications to Vitis Unified IDE for information on migrating AI Engine graph applications from the classic IDE to the Vitis Unified IDE.

The next-generation Vitis Unified IDE provides system project design and debug for heterogeneous computing systems, embedded system design, and data center acceleration. Elements of the system project includes AI Engine and High-Level Synthesis (HLS) component creation, platform creation, and embedded software design.

The Vitis Unified IDE uses a common command-line to compile and run the elements of the design. See Compile using v++ (Unified Compiler) for a review of the v++ command-line flows for developing AI Engine components.