Before we start doing any optimization, it is helpful to start by determine the baseline performance of the design. We can confirm the functionality with a functional simulation and analyze the initial performance so we know how much optimization is required. Many users also find the optimized IP block useful - by exporting an IP block earlier, other engineers, such as a higher level system integrator can work simultaneously with the IP developer, reducing overall time to market.
To begin, we’ll create a new Vitis workspace and create an HLS component with the provided beamformer code.
Open the Vitis Unified IDE and specify a new or existing workspace.
Create a new HLS Component by clicking
Create Component
underHLS Development
in the Welcome Screen.In the
Source Files
step, add the file./reference_files/beamformer.cpp
as a Design File, add the files./reference_files/beamformer_tb.cpp
and./reference_files/result.golden_float.dat
as Test Bench Files, set theTop Function
tobeamformer
, and press NextIn the Hardware step, set the part to the Versal Premium series device,
vp1202-vsva2785-1LP-i-L
, press next, set the clock target to3ns
, then finish the Wizard.Run and verify the results of C Simulation by pressing Run under C SIMULATION in the FLOW panel. The output should resemble the following:
beamso_i beamso_q
-225.000000 1865.000000
-300.000000 1970.000000
-375.000000 2105.000000
beamso_i beamso_q
-150.000000 1790.000000
-225.000000 1865.000000
-300.000000 1970.000000
beamso_i beamso_q
-75.000000 1745.000000
-150.000000 1790.000000
-225.000000 1865.000000
Test passed !
INFO: [SIM 211-1] CSim done with 0 errors.
INFO: [SIM 211-3] *************** CSIM finish ***************
INFO: [HLS 200-111] Finished Command csim_design CPU user time: 1 seconds. CPU system time: 1 seconds. Elapsed time: 20.18 seconds; current allocated memory: 1.480 MB.
INFO: [HLS 200-1510] Running: close_project
INFO: [HLS 200-112] Total CPU user time: 3 seconds. Total CPU system time: 4 seconds. Total elapsed time: 24.896 seconds; peak allocated memory: 187.500 MB.
INFO: [Common 17-206] Exiting vitis_hls at Wed Dec 13 15:14:09 2023...
INFO: [vitis-run 60-791] Total elapsed time: 0h 0m 32s
C-simulation finished successfully
By default, HLS does many optimizations to ensure a good balance of performance versus resource utilization. So, before we run C Synthesis, we want to ensure all optimizations are turned off. This can be done by editing directly in the code or by using the HLS Directive
panel on the right hand side.
Open
./reference_files/beamformer.cpp
and add#pragma HLS PIPELINE off
to loopsL1:
,L2:
, andL3:
, as shown here:
L1:for (i=0; i<SAMPLES; i++) {
#pragma HLS LOOP_FLATTEN off
#pragma HLS PIPELINE off
L2: for (j=0; j<BEAMS; j++) {
#pragma HLS LOOP_FLATTEN off
#pragma HLS PIPELINE off
si=0;
sq=0;
L3: for (k=0; k<CHANNELS; k++) {
#pragma HLS LOOP_FLATTEN off
#pragma HLS PIPELINE off
The loop flatten pragmas are already included in the unoptimized code provided, so you don’t have to add them manually.
Run C Synthesis and open the Synthesis Report. Expand the Performance and Resource Estimates section to reveal all three nested loops:
Modules & Loops | LATENCY(CYCLES) | LATENCY(NS) | ITERATION LATENCY | INTERVAL | TRIP COUNT |
---|---|---|---|---|---|
beamformer | 3147501 | 9.443E6 | - | 3147502 | - |
L1 | 3147500 | 9.443E6 | 1259 | - | 2500 |
L2 | 1257 | 3.771E3 | 419 | - | 3 |
L3 | 416 | 1.248E3 | 26 | - | 16 |
The overall performance of this unoptimized design is best measured by its interval, which is 3,147,502 cycles. The interval of the top level hardware function is the number of cycles after which the function can begin a second time. In addition, we can see that the current implementation is completely sequential and therefore a good candidate for acceleration if we can properly introduce parallelism into the design. We can tell the design is sequential because the latency of each iteration of each loop is additive. Starting from loop L3
, we can see that there are 16 iterations and each iteration has a latency of 26 clock cycles; when each iteration is run sequentially, the latency of the overall loop is 16 times 26 or 416 clock cycles. This becomes the latency of each iteration in loop L2
plus a few extra cycles at the beginning and end of each iteration, then the pattern continues until the latency of the overall module is excessively large.