Overview - 2023.1 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2023-06-23
Version
2023.1 English

AMD Versal™ adaptive system-on-chips (SoCs) combine Scalar Engines, Adaptable Engines, and Intelligent Engines with leading-edge memory and interfacing technologies to deliver powerful heterogeneous acceleration for any application. Most importantly, Versal adaptive SoCs hardware and software are targeted for programming and optimization by data scientists and software and hardware developers. Versal adaptive SoCs are enabled by a host of tools, software, libraries, IP, middleware, and frameworks to enable all industry-standard design flows.

Built on the TSMC 7 nm FinFET process technology, the Versal portfolio is the first platform to combine software programmability and domain-specific hardware acceleration with the adaptability necessary to meet today's rapid pace of innovation. The portfolio includes six series of devices uniquely architected to deliver scalability and AI inference capabilities for a host of applications across different markets—from cloud—to networking—to wireless communications—to edge computing and endpoints.

The Versal architecture combines different engine types with a wealth of connectivity and communication capability and a network on chip (NoC) to enable seamless memory-mapped access to the full height and width of the device. Intelligent Engines are SIMD VLIW AI Engines for adaptive inference and advanced signal processing compute, and DSP Engines for fixed point, floating point, and complex MAC operations. Adaptable Engines are a combination of programmable logic blocks and memory, architected for high-compute density. Scalar Engines, including Arm® Cortex®-A72 and Cortex-R5F processors, allow for intensive compute tasks.

AI Engines

The Versal AI Core series delivers breakthrough AI inference acceleration with AI Engines that deliver over 100x greater compute performance than current server-class of CPUs. This series is designed for a breadth of applications, including cloud for dynamic workloads and network for massive bandwidth, all while delivering advanced safety and security features. AI and data scientists, as well as software and hardware developers, can all take advantage of the high compute density to accelerate the performance of any application. Given the AI Engine's advanced signal processing compute capability, it is well-suited for highly optimized wireless applications such as radio, 5G, backhaul, and other high-performance DSP applications.

AI Engines are an array of very-long instruction word (VLIW) processors with single instruction multiple data (SIMD) vector units that are highly optimized for compute-intensive applications, specifically digital signal processing (DSP), 5G wireless applications, and artificial intelligence (AI) technology such as machine learning (ML).

AI Engines are hardened blocks that provide multiple levels of parallelism including instruction-level and data-level parallelism. Instruction-level parallelism includes a scalar operation, up to two moves, two vector reads (loads), one vector write (store), and one vector instruction that can be executed—in total, a 7-way VLIW instruction per clock cycle. Data-level parallelism is achieved via vector-level operations where multiple sets of data can be operated on a per-clock-cycle basis. Each AI Engine contains both a vector and scalar processor, dedicated program memory, local 32 KB data memory, access to local memory in any of three neighboring directions. It also has access to DMA engines and AXI4 interconnect switches to communicate via streams to other AI Engines or to the programmable logic (PL) or the DMA. Refer to the Versal Adaptive SoC AI Engine Architecture Manual (AM009) for specific details on the AI Engine array and interfaces.

The AI Engine-ML (AIE-ML) block is capable of delivering 2x compute throughput compared to its predecessor AI Engine blocks. The AIE-ML block, primarily targeted for machine learning inference applications, delivers one of the industry's best performance per Watt for a wide range of inference applications.

As an application user, it is possible to use one of the white box or black box flows for running a ML inference application on AIE-ML. The white box flow uses the libraries element where you can integrate custom kernels and dataflow graphs in the AIE-ML programming environment. A black box flow uses performance optimized Deep learning Processing Unit (DPU) IP from AMD to accelerate ML workload in the AIE-ML block.

AMD Vitis™ AI is used as a front-end tool that parses the network graph, performs optimization, quantization of the graph, and generates a quantized network model that can be accelerated on the AIE-ML hardware. The AIE-ML core tile architecture supports multiple precision fixed and floating-point datatypes with pipeline vector processing high-density, high-speed on-chip memory that can be used for storing on-chip tensors and flexible datamovers capable of addressing multi-dimensional tensors in memory. With the proper selection of overlay processor architecture and spatial and temporal distribution of the input/output tensor in the on/off-chip memory, it is possible to achieve higher computational efficiency of the AIE-ML processing cores.

AI Engine Architecture Overview provides a high-level overview of the AI Engine architecture, tools, and documents that can be referenced for kernel programming.

AI Engine Kernels

An AI Engine kernel is a C/C++ program which is written using specialized APIs that target the VLIW vector processor. The AI Engine kernel code is compiled using the AI Engine compiler (aiecompiler) that is included in the AMD Vitis™ core development kit. The AI Engine compiler compiles the kernels to produce an ELF file that is run on the AI Engine processors.

AI Engine Graphs

An AI Engine program consists of a data flow graph specification which is written in C++. This specification can be compiled and executed using the AI Engine compiler. An adaptive data flow (ADF) graph application consists of nodes and edges where nodes represent compute kernel functions, and edges represent data connections. Kernels in the application can be compiled to run on the AI Engines or in the PL region of the device. Refer to AI Engine Kernel and Graph Programming Guide (UG1079) for more information about AI Engine how to develop, debug and optimize AI Engine kernels and graphs. It also include information on specialized graph constructs and ways to control the AI Engine graph.

Controlling the AI Engine Graph

Programming the PS Host Application describes the process of creating a host application to control the graph and PL kernels of the system. When your design is deployed in hardware, you can install drivers that facilitate initializing and controlling the graph execution via a host application running on the PS, or load and run the AI Engine graph at device boot time.

Application-specific AI Engine control code is generated by the AI Engine compiler as part of compiling the AI Engine design graph and kernel code. The AI Engine control code can:

  • Control the initial loading of the AI Engine kernels.
  • Run the graph for several iterations, update the run-time parameters (RTP) associated with the graph, exit and reset the AI Engines.
    Note: A graph can have multiple kernels, input and output ports. The graph connectivity, which is equivalent to the nets in a data flow graph is either between the kernels, between kernel and input ports, or between kernel and output ports, and can be configured as a connection. A graph runs for an iteration when it consumes data samples equal to the window or stream of data expected by the kernels in the graph, and produces data samples equal to the window or stream of data expected at the output of all the kernels in the graph.

The Vitis core development kit provides the xilinx_vck190_base_202310_1 platform and the xilinx_vck190_base_dfx_202310_1 platform for building, simulating, debugging, and deploying your AI Engine designs, targeting the VCK190 board. It enables development of a design including AI Engine and PL kernels with a host application that targets the Linux OS running on the Arm processor in the PS. Designs developed on this platform can be verified using the hardware emulation flow. These designs can also run on the VCK190 board.

Compiling and Simulating the Program

Compiling an AI Engine Graph Application describes in detail the different types of compilation available with the AI Engine compiler, the options and input files that can be passed in, and the expected output. You can compile the graph and kernels independently, or as part of a larger system, and set up the design to capture and profile event trace data at run time.

Simulating an AI Engine Graph Application describes the AI Engine simulator in detail, as well as the x86 simulator for functional simulation. The AI Engine simulator simulates the graph application as a standalone entity, or as part of the hardware emulation of a larger system design.

Performance Analysis of AI Engine Graph Application during Simulation describes how to extract performance data by performing event tracing when running the hardware emulation build or the hardware build. This data can be used to further optimize the AI Engine kernels and graphs.

Mapper/Router Methodology describes the mapper and router methodology to be used when handling a failure in the AI Engine compiler during the mapper and/or router phase.

Integrating and Deploying the AI Engine Graph as Part of a Versal Adaptive SoC System Design

The AI Engine kernels and graph developed in the previous steps can used as part of a larger Versal adaptive SoC system design that can consist of AI Engine kernels, HLS PL kernels, RTL kernels, and the host application. The Vitis compiler builds this larger system.

As described in Integrating the Application Using the Vitis Tools Flow, you can use a command-line approach for building the system, or use the a GUI-based approach as described in Using the Vitis IDE. Either approach lets you perform simulation or emulation to verify the design, debug the design in an interactive debug environment, and build the design to deploy on hardware.

Integrating the Application Using the Vitis Tools Flow also introduces the Dynamic Function eXchange (DFX) platform deployment flow in Targeting the DFX Platform. The DFX flow allows you to load or reload xclbin into DFX region at run time, and thus to reset the AI Engine design and PL kernels.

Profiling and Debugging Designs with AI Engine in Hardware

Performance Analysis of AI Engine Graph Application on Hardware describes how to profile and extract performance data by performing event tracing when running the design in hardware.

Debugging the AI Engine Application shows you how to run and use the debug environment from the command line, or from the Vitis IDE. The evaluation of the system performance and debugging the application are the key steps to achieve the application objectives.

AI Engine Hardware Profile and Debug Methodology describes the five stage profile and debug methodology to use when you run a Versal design with AI Engine graphs in hardware.