Design Migration - 2025.1 English - XD261

Vitis Tutorials: Vitis HLS (XD261)

Document ID
XD261
Release Date
2025-06-17
Version
2025.1 English

In this final section, we will highlight one of the key advantages of developing in Vitis HLS - the ability to quickly migrate a design from one AMD part to another. In our case, we will change the target device from the Versal Premium (vp1202) device we have been using to a Zynq Ultrascale+ RFSoC (ZU28DR). Then we will compare the resulting performance and resources.

In this tutorial, we’ll be comparing the highest performance version of the beamformer algorithm, but you could just as easily make the comparison on a different version. One last time, feel free to clone the component you want to make comparisons to if you want to preserve the state of the existing component. This is again optional but does make it easy to do a side by side comparison, as shown in the following image:

Synthesis Results View showing Versal Premium versus Zynq Ultrascale+ RFSoC

To target the new device and generate the results shown:

  1. Select the desired HLS component in the FLOW panel.

  2. Click the gear to the right of the component selection drop-down.

  3. Select hls_config.cfg

  4. Under ‘General’, go to ‘part’, and either Browse to or type in xczu28dr-ffve1156-1L-i

  5. Run C Synthesis and view the Synthesis Report

Here is a summary of the pertinent information from the two reports:

MODULES & LOOPS INTERVAL (CYCLES) DSP FF LUT
Versal Premium 2625 288 47456 11454
Zynq Ultrascale+ RFSoC 2782 576 177434 111181

With no change to the HLS C code or compiler directives, we can a significant increase to the resource utilization of this design and a minor decrease in performance when targeting the previous generation part.

The performance decrease comes from the increase in latency of the complex multiply. In both cases, the II of loop L1 remains equal to 1. However, the iteration latency increases due the to less efficient DSP48 architecture. The latency of the loop is then equal to the quantity loop Trip Count multiplied by II, plus the iteration latency incurred waiting for the last iteration of the loop to finish. This causes the overall decrease in performance of the application. Note - the PIPELINE pragma does have the ability to start the next loop transaction without waiting for the pipeline to flush; this is possible with the rewind option. The rewind option will have no affect on latency but will decrease II, thus increasing the throughput.

Targeting the Ultrascale+ part had a larger impact on utilization. As you can see, the resources required by the ZU28 are on average 4x more than for Versal: DSP utilization increased by about 2X, FF utilization increased by about 4X and LUT utilization increase by almost 10X. This is due to the more efficient DSP58 primitive on the Versal Premium device compared to the DSP48 primitive on the Zynq Ultrascale+ device. In the native floating-point mode, the DSP58 can compute a floating point multiply accumulate with just one DSP primitive. The DSP48 in ZU+ has only a fixed-point mode and must, therefore, use significantly more resources to implement complex multiple of data type. On average, DSP58 devices are 4 times more compute efficient compared to DSP48 devices.