Unrolling Loops - 2023.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
Release Date
2023.2 English

A loop is executed for the number of iterations specified by the loop induction variable. The number of iterations might also be impacted by logic inside the loop body (for example, break conditions or modifications to a loop exit variable). You can unroll loops to create multiple copies of the loop body in the RTL design, which allows some or all loop iterations to occur in parallel. Using the UNROLL pragma you can unroll loops to increase data access and throughput.

By default, HLS loops are kept rolled. This means that each iteration of the loop uses the same hardware. Unrolling the loop means that each iteration of the loop has its own hardware to perform the loop function. This means that the performance for unrolled loops can be significantly better than for rolled loops. However, the added performance comes at the expense of added area and resource utilization.

Consider the basic_loops_primer example from GitHub, as shown below:

#include "test.h"
dout_t test(din_t A[N]) { 
  dout_t out_accum=0;
  dsel_t x;
  LOOP_1:for (x=0; x<N; x++) {
      out_accum += A[x];
  return out_accum;

With no optimization, the Synthesis Summary report in the figure below shows that the implementation is sequential. This can be confirmed by looking at the trip count for LOOP_1, which reports the number of iterations as 10 and the Latency as 200. The latency is the time before the loop can accept new input values.

Figure 1. Performance & Resource Estimates

To get optimal throughput, the latency needs to be as short as possible. To increase performance, assuming the loop bounds are static, the loop can be fully unrolled using the UNROLL pragma to create parallel implementations of the loop body. After the LOOP_1 is fully unrolled a significant reduction in the latency (50 ns) is shown in the figure below. Unrolling loops implies a trade-off by achieving higher performance but at the cost of using extra resources (as seen below in the increase of FFs and LUTs). Fully unrolling the loop will also cause the loop itself to disappear and be replaced by the parallel implementations of the loop body which will use up the extra resources as shown below.

Figure 2. Performance & Resource Estimates

Of course, there will be cases where it is not possible to unroll the loop completely due to the increase in resources and the available resources of the platform. In this situation, partially unrolled loops can be the preferred solution offering some improvement of performance while not requiring as many resources. To partially unroll a loop you will define an unroll factor for the pragma or directive. Unrolling the same loop with a factor of 2 (which implies that the loop body is duplicated and the trip count is reduced by half to 5) can be an acceptable solution for this constrained case as shown below.

Figure 3. Performance & Resource Estimates

Additionally, when you partially unroll the loop, the HLS tool will implement an exit check in the loop in case the trip count is not perfectly divisible by the unroll factor. The exit check is skipped if the trip count is perfectly divisible by the unroll factor.