Working with Nested Loops - 2025.2 English - UG1399

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2025-11-20
Version
2025.2 English

A loop can be pipelined only if it does not contain any loop (possibly after some inner loop unrolling). A loop containing a pipelined loop is therefore either dataflow or sequential. When it is sequential, its iterations are executed in sequence and, for each iteration, the inner loop is fully executed once. In the RTL implementation, this requires one clock cycle to move from the outer loop to the inner loop, one clock cycle to move back from the inner loop to the outer loop, plus, in between, the whole latency to complete all iterations of the inner loop. To get better performance (lower latency) when working with nested loops, it becomes crucial to create perfectly nested loops and to convert them into a single loop that can be pipelined.

Perfect_nested_loop_1: for (int i = 0; i < N; ++i) {
    Perfect_nested_loop_2: for (int j = 0; j < M; ++j) {
        ...
    }
}
 
Imperfect_nested_loop_0: for (int i = 0; i < N; ++i) {
    Imperfect_nested_loop_2: for (int j = 0; j < M; ++j) {
        ...
    }
    Imperfect_nested_loop_3: for (int j = 0; j < M; ++j) {
        ...
    }
}
Perfect loop nest
Only the innermost loop has loop body content, there is no logic specified between the loop statements and all the loop bounds are constant
Semi-perfect loop nest
Only the innermost loop has loop body content, there is no logic specified between the loop statements but the outermost loop bound can be a variable.
Imperfect loop nest
The inner loop has variable bounds or the loop body is not exclusively inside the inner loop. In this case designers should try to restructure the code or unroll the loops in the loop body to create a perfect loop nest.

Perfect loop nests are nested loops where each outer loop contains only one subloop and nothing else. Almost-perfect loop nests (nested loops where each outer loop can contain, in addition, elementary instructions) can be automatically transformed into perfect loop nest by pushing these instructions into the innermost loop.

The LOOP_FLATTEN pragma or directive is used to allow perfect and almost-perfect nested loops to be flattened (with some additional important flatten ability requirements about their tripcount, see the description of the loop_flatten pragma or directive), removing the need to re-code for optimal hardware performance and reducing the number of cycles it takes to perform the operations in the loop. When the LOOP_FLATTEN optimization is applied to a set of nested loops, it should be applied to the innermost loop that contains the loop body.

When pipelining nested loops, the optimal balance between area and performance is typically found by pipelining the innermost loop. This also results in the fastest runtime. The following code example, pipelined_loop available on GitHub, demonstrates the trade-offs when pipelining loops and functions.

#include "loop_pipeline.h"
 
dout_t loop_pipeline(din_t A[N]) { 
    int i,j;
    static dout_t acc;
   
    LOOP_I:for(i=0; i < 20; i++){
        LOOP_J: for(j=0; j < 20; j++){
            acc += A[j] * i;
        }
    }
    return acc;
}

In the above example, if the innermost (LOOP_J) is pipelined, there is one copy of LOOP_J in hardware (a single multiplier). Vitis HLS automatically flattens the loops when possible, as in this case, and effectively creates a new single loop (now called LOOP_I_LOOP_J) with 20*20 iterations. Only one multiplier operation and one array access need to be scheduled, then the loop iterations can be scheduled as a single loop-body entity (20x20 loop iterations). Thanks to flattening, the latency of the body is paid once, instead of 20 times.

If the outer loop (LOOP_I) is pipelined, inner-loop (LOOP_J) is unrolled creating 20 copies of the loop body: 20 multipliers and 1 array accesses must now be scheduled. Then each iteration of LOOP_I can be scheduled as a single entity.

If the top-level function is pipelined, both loops must be unrolled: 400 multipliers and 20 array accesses must now be scheduled. It is very unlikely that Vitis HLS produces a design with 400 multiplications because, in most designs, data dependencies often prevent maximal parallelism, for example, even if a dual-port RAM is used for A, the design can only access two values of A in any clock cycle. Otherwise, the array must be partitioned into 400 registers, which then can all be read in one clock cycle, with a very significant HW cost.

The concept to appreciate when selecting at which level of the hierarchy to pipeline is to understand that pipelining the innermost loop gives the smallest hardware with generally acceptable throughput for most applications. Pipelining the upper levels of the hierarchy unrolls all sub-loops and can create many more operations to schedule (which could impact compile time and memory capacity), but typically gives the highest performance design in terms of throughput and latency. The data access bandwidth must be matched to the requirements of the operations that are expected to be executed in parallel. This implies that you might need to partition array A to exploit the potential parallelism exposed by loop unrolling.

To summarize the above options:

  • Pipeline LOOP_J flattened with LOOP_I: Latency is approximately 400 cycles (20x20) and requires less than 250 LUTs and registers (the I/O control and FSM are always present).

    Figure 1. Performance & Resource Estimates
  • Pipeline LOOP_I: Latency is 13 cycles but requires a few hundred LUTs and registers. About twice the logic as the first option, minus any logic optimizations that can be made.

    Figure 2. Performance & Resource Estimates
  • Pipeline function loop_pipeline: Latency is now only 3 cycles (due to 20 parallel register accesses) but requires almost twice the logic as the second option (and about 4 times the logic of the first option), minus any optimizations that can be made.

    Figure 3. Performance & Resource Estimates

Vitis HLS cannot flatten imperfect loop nests (loops containing more than one loop or control flow). This results in additional clock cycles to enter and exit the loops, and the latency of the inner loops is paid for each iteration of the outer loop. When the design contains nested loops, analyze the results to ensure that as many nested loops as possible have been flattened: review the log file or look in the synthesis report for cases (as shown above) where the loop labels have been merged (LOOP_I and LOOP_J are now reported as LOOP_I_LOOP_J)