Assembly and Simulation Using Hardware Emulation - 2021.1 English

Versal ACAP Design Guide (UG1273)

Document ID
UG1273
Release Date
2021-06-30
Version
2021.1 English

In the second step of this design flow, you gradually assemble components (PS, PL, and AI Engine) of the processing system on top of the target platform and use the Vitis hardware emulation flow to simulate the integrated system. Hardware emulation is a cycle-approximate simulation of the system. The AI Engine graph runs in the SystemC simulator (aiesimulator). RTL behavioral models of the PL run in the Vivado simulator or a supported third-party simulator. The software code executing on the PS is simulated using the Xilinx Quick Emulator (QEMU).

The target platform contains all of the necessary hardware and software infrastructure resources required for the project. It is possible to target a standard Xilinx platform or a custom platform for your project. At this step in the flow, Xilinx recommends using a standard and pre-verified platform to reduce uncertainty in the process and focus efforts on the system components (graph and kernels).

The Vitis linker (v++ --link) is used to assemble the compiled AI Engine graph (libadf.a) and PL kernels (.xo) with the targeted platform. The Vitis linker establishes connections between the AI Engine ports, PL kernels, and other platform resources.

Because this design flow progresses gradually, certain elements might not exist in early iterations. You might need to terminate unconnected signals, drive signals, or provide sinks. In this case, unterminated streaming connections between the AI Engine graph and PL kernels (PLIOs and AXI4-Stream) require the addition of traffic generators and test benches for emulation purposes, which can be added during the linking process using v++ options. The Vitis flow supports C++, SystemC, Python, and RTL traffic generators and test benches.

The Vitis linker automatically inserts FIFOs on streaming interfaces as well as clock domain converters (CDC) and data width converters (DWC) between the AI Engine and PL kernels as needed. On the Versal ACAP, the clock on the AI Engine array can run at 1 GHz or more based on the device speed grade, but the clock in the PL region runs at a different, lower frequency. This means there can be a difference between the data throughput of the AI Engine kernels and the PL kernels based on their clock frequencies. When linking the processing system, the Vitis compiler can insert CDCs, DWCs, and FIFOs to match the throughput capacities of the PL and AI Engine regions.

The Vitis packager (v++ --package) is used to add the PS program and to generate the required setup to run hardware emulation. The PS program controls the AI Engine graphs and the PL kernels by leveraging the XRT APIs as follows. XRT is an open-source library that makes it easy to interact with PL kernels and AI Engine graphs from a software application, either embedded or x86-based.

AI Engine graphs
Controlling the AI Engine graph includes operations such as loading the graph, initializing it, updating RTPs and GMIOs, waiting and resuming execution. To perform these operations, the PS program must use the graph APIs included in the XRT library.
PL kernels
Controlling the PL kernels includes operations such as reading and writing kernel registers, transferring data to and from memory, and starting the kernel and waiting for its completion. To perform these operations, Xilinx recommends using the PL kernel APIs included in the XRT library. Because PL kernels are controlled through regular register access operations, it is also possible to use userspace I/O (UIO) drivers instead of using the XRT APIs.

In this step, most models are cycle accurate. However, some models are only approximate, and other models are transaction-level models (TLM). PL kernels are simulated using the target clock, which is not guaranteed to be met during implementation. The interactions between the AI Engine graph and PL kernels are modeled at the cycle level, but overall accuracy depends on the accuracy of the patterns produced by the traffic generators and other test bench modules. The impact of complex I/O interactions cannot be accurately modeled. The slower performance of the emulation environment limits the number of traffic/vectors that can be tested.

Note: Meeting performance in hardware emulation is necessary but is not a guarantee of results. Hardware emulation is cycle approximate with better accuracy in performance than during the first step in the design flow. However, performance results are still not final at this stage.

For more information on how to assemble and simulate the processing system, see the System Simulation section of the Versal ACAP Design Process Documentation: System Integration and Validation.