Baseline and Analyze the Design Before Optimization - 2025.1 English - XD261

Vitis Tutorials: Vitis HLS (XD261)

Document ID
XD261
Release Date
2025-06-17
Version
2025.1 English

Before we start doing any optimization, it is helpful to start by determine the baseline performance of the design. We can confirm the functionality with a functional simulation and analyze the initial performance so we know how much optimization is required. Many users also find the optimized IP block useful - by exporting an IP block earlier, other engineers, such as a higher level system integrator can work simultaneously with the IP developer, reducing overall time to market.

To begin, we’ll create a new Vitis workspace and create an HLS component with the provided beamformer code.

  1. Open the Vitis Unified IDE and specify a new or existing workspace.

  2. Create a new HLS Component by clicking Create Component under HLS Development in the Welcome Screen.

  3. In the Source Files step, add the file ./reference_files/beamformer.cpp as a Design File, add the files ./reference_files/beamformer_tb.cpp and ./reference_files/result.golden_float.dat as Test Bench Files, set the Top Function to beamformer, and press Next

  4. In the Hardware step, set the part to the Versal Premium series device, vp1202-vsva2785-1LP-i-L, press next, set the clock target to 3ns, then finish the Wizard.

  5. Run and verify the results of C Simulation by pressing Run under C SIMULATION in the FLOW panel. The output should resemble the following:

      beamso_i   beamso_q
      -225.000000 1865.000000 
      -300.000000 1970.000000 
      -375.000000 2105.000000 
      beamso_i   beamso_q
      -150.000000 1790.000000 
      -225.000000 1865.000000 
      -300.000000 1970.000000 
      beamso_i   beamso_q
      -75.000000 1745.000000 
      -150.000000 1790.000000 
      -225.000000 1865.000000 
     Test passed !
     INFO: [SIM 211-1] CSim done with 0 errors.
     INFO: [SIM 211-3] *************** CSIM finish ***************
     INFO: [HLS 200-111] Finished Command csim_design CPU user time: 1 seconds. CPU system time: 1 seconds. Elapsed time: 20.18 seconds; current allocated memory: 1.480 MB.
     INFO: [HLS 200-1510] Running: close_project 
     INFO: [HLS 200-112] Total CPU user time: 3 seconds. Total CPU system time: 4 seconds. Total elapsed time: 24.896 seconds; peak allocated memory: 187.500 MB.
     INFO: [Common 17-206] Exiting vitis_hls at Wed Dec 13 15:14:09 2023...
     INFO: [vitis-run 60-791] Total elapsed time: 0h 0m 32s
     C-simulation finished successfully
    

    By default, HLS does many optimizations to ensure a good balance of performance versus resource utilization. So, before we run C Synthesis, we want to ensure all optimizations are turned off. This can be done by editing directly in the code or by using the HLS Directive panel on the right hand side.

  6. Open ./reference_files/beamformer.cpp and add #pragma HLS PIPELINE off to loops L1:, L2:, and L3:, as shown here:

       L1:for (i=0; i<SAMPLES; i++) {
    #pragma HLS LOOP_FLATTEN off
    #pragma HLS PIPELINE off      
          L2: for (j=0; j<BEAMS; j++) {
    #pragma HLS LOOP_FLATTEN off
    #pragma HLS PIPELINE off
             si=0;
             sq=0;
    
             L3: for (k=0; k<CHANNELS; k++) {
    #pragma HLS LOOP_FLATTEN off
    #pragma HLS PIPELINE off
    

    The loop flatten pragmas are already included in the unoptimized code provided, so you don’t have to add them manually.

  7. Run C Synthesis and open the Synthesis Report. Expand the Performance and Resource Estimates section to reveal all three nested loops:

Modules & Loops LATENCY(CYCLES) LATENCY(NS) ITERATION LATENCY INTERVAL TRIP COUNT
beamformer 3147501 9.443E6 - 3147502 -
L1 3147500 9.443E6 1259 - 2500
L2 1257 3.771E3 419 - 3
L3 416 1.248E3 26 - 16

The overall performance of this unoptimized design is best measured by its interval, which is 3,147,502 cycles. The interval of the top level hardware function is the number of cycles after which the function can begin a second time. In addition, we can see that the current implementation is completely sequential and therefore a good candidate for acceleration if we can properly introduce parallelism into the design. We can tell the design is sequential because the latency of each iteration of each loop is additive. Starting from loop L3, we can see that there are 16 iterations and each iteration has a latency of 26 clock cycles; when each iteration is run sequentially, the latency of the overall loop is 16 times 26 or 416 clock cycles. This becomes the latency of each iteration in loop L2 plus a few extra cycles at the beginning and end of each iteration, then the pattern continues until the latency of the overall module is excessively large.