Working with Variable Loop Bounds - 2022.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2022-12-07
Version
2022.2 English

Some of the optimizations that Vitis HLS can apply are prevented when the loop has variable bounds. In the following code example, variable_bound_loops on GitHub, the loop bounds are determined by the variable width, which is driven from a top-level input. In this case, the loop is considered to have a variable bound, because Vitis HLS cannot know when the loop will complete.

#include "ap_int.h"
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t code028(din_t A[N], dsel_t width) { 
 
 dout_t out_accum=0;
 dsel_t x;
 
 LOOP_X:for (x=0;x<width; x++) {
 out_accum += A[x];
 }
 
 return out_accum;
}

Attempting to optimize the design in the example above reveals the issues created by variable loop bounds. The first issue with variable loop bounds is that they prevent Vitis HLS from determining the latency of the loop. Vitis HLS can determine the latency to complete one iteration of the loop, but because it cannot statically determine the exact variable width, it does not know how many iterations are performed and thus cannot report the loop latency (the number of cycles to completely execute all iterations of the loop).

When variable loop bounds are present, Vitis HLS reports the latency as a question mark (?) instead of using exact values. The following shows the result after the synthesis of the previous example:

+ Summary of overall latency (clock cycles):
 * Best-case latency:    ?
 * Worst-case latency:   ?
+ Summary of loop latency (clock cycles):
 + LOOP_X:
 * Trip count: ?
 * Latency:    ?

The way to overcome this issue is to use the LOOP_TRIPCOUNT pragma or directive to specify a minimum and/or maximum iteration count for the loop. The tripcount is the number of loop iterations. If a maximum tripcount of 32 is applied to LOOP_X in the first example, the report is updated to the following:

+ Summary of overall latency (clock cycles):
 * Best-case latency:    2
 * Worst-case latency:   34
+ Summary of loop latency (clock cycles):
 + LOOP_X:
 * Trip count: 0 ~ 32
 * Latency:    0 ~ 32

The user-provided values for the LOOP_TRIPCOUNT directive are used only for reporting, or to support the PERFORMANCE pragma or directive. The specified tripcount value allows Vitis HLS to determine latency values in the report, allowing values from different solutions to be compared. To have this same loop-bound information used for synthesis, the C/C++ code must be updated by using asserts, which impact synthesis (however, they must be used carefully since the assert condition is assumed to be true).

The next steps in optimizing the first example for a lower initiation interval are:

  • Unroll the loop and allow the accumulations to occur in parallel.
  • Partition the array input, or the parallel accumulations are limited by a single memory port.

If these code transformations are applied, the output from Vitis HLS highlights the most significant issue with variable bound loops:

WARNING: [HLS 200-936] Cannot unroll loop 'LOOP_X' (loop_var.cpp:22) in 
function 'loop_var': cannot completely unroll a loop with a variable trip count.

Because variable bounds loops cannot be fully unrolled, they not only prevent the unroll directive from being applied, they also prevent pipelining the levels above the loop.

Important: When a loop or function is pipelined, Vitis HLS unrolls all loops in the hierarchy below the function or loop. If there is a loop with variable bounds in this hierarchy, it prevents pipelining.

The solution to loops with variable bounds is to make the number of loop iteration a fixed value with conditional executions inside the loop. The code from the variable loop bounds example can be rewritten as shown in the following code example. Here, the loop bounds are explicitly set to the maximum value of variable width and the loop body is conditionally executed:

#include "ap_int.h"
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t loop_max_bounds(din_t A[N], dsel_t width) { 
 
 dout_t out_accum=0;
 dsel_t x;
 
 LOOP_X:for (x=0; x<N; x++) {
 if (x<width) {
  out_accum += A[x];
 }
 }
 
 return out_accum;
}

The for-loop (LOOP_X) in the example above can be fully unrolled. Because the loop has fixed upper bounds, Vitis HLS knows how much hardware to create. There are N(32) copies of the loop body in the RTL design. Each copy of the loop body has conditional logic associated with it and is executed depending on the value of variable width. Refer to Vitis-HLS-Introductory-Examples/Modeling/variable_bound_loops for an example.