Hardware emulation uses a mix of SystemC and RTL co-simulation to provide a balance between accuracy and speed of simulation. The SystemC models are a mix of purely functional models and performance approximate models. Hardware emulation does not mimic hardware accuracy 100%, therefore you should expect some differences in behavior between running emulation and executing your application on hardware. This can lead to significant differences in application performance, and sometimes differences in functionality can also be observed.
Functional differences with hardware typically point to a race condition or some unpredictable behavior in your design. So, an issue seen in hardware might not always be reproducible in hardware emulation, though most behavior related to interactions between the host and the accelerator, or the accelerator and the memory are reproducible in hardware emulation. This makes hardware emulation an excellent tool to debug issues with your accelerator prior to running on hardware.
The following table lists models that are used to mimic the hardware platform and their accuracy levels.
Hardware Functionality | Description |
---|---|
Host to Card
PCIe®
Connection
and DMA (XDMA, SlaveBridge) |
For data center platforms, the connection to the x86 host server over PCIe is done as a purely functional model and does not have any performance modeling. Thus, any issues related to PCIe bandwidth cannot be reflected in hardware emulation runs. |
AMD UltraScale™ DDR Memory, SmartConnect | The SystemC models for the DDR memory controller, AXI SmartConnect, and other data path IPs are usually throughput approximate. They typically do not model the exact latency of the hardware IP. The model can be used to gauge a relative performance trend as you modify your application or the accelerator kernel. |
AI Engine | The AI Engine SystemC model is cycle approximate, though it is not intended to be 100% cycle accurate. A common model is used between AI Engine Simulator and hardware emulation, thus enabling a reasonable comparison between the two stages. |
AMD Versal™ NoC and DDR Models | The Versal NoC and DDR SystemC models are cycle approximate. |
Arm Processing Subsystem (PS, CIPS) | The Arm PS is modeled using QEMU, which is a purely functional execution model. For more information, see QEMU. |
User Kernel (accelerator) | Hardware emulation uses RTL for the user accelerator. As follows, the accelerator behavior by itself is 100% accurate. However, the accelerator is surrounded by other approximate models. |
Other I/O Models | For hardware emulation, there is generic Python or C-based traffic generator which can be interfaced with the emulation process. You can generate abstract traffic at AXI protocol level which mimics the I/O in your design. Because these models are abstract, any issues observed on the specific hardware board will not be shown in hardware emulation. |
Because hardware emulation uses RTL co-simulation as its execution model, the
speed of execution is orders of magnitude slower as compared to real hardware. AMD recommends using small data buffers. For example, if you have a
configurable vector addition and in hardware you are performing a 1024 element
vadd
, in emulation you might restrict it to 16 elements. This will
enable you to test your application with the accelerator, while still completing
execution in reasonable time.