Some of the optimizations that Vitis HLS can
apply are prevented when the loop has variable bounds. In the following code example,
variable_bound_loops on GitHub, the loop
bounds are determined by the variable width
, which is
driven from a top-level input. In this case, the loop is considered to have a variable
bound, because Vitis HLS cannot know when the loop
will complete.
#include "ap_int.h"
#define N 32
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
dout_t code028(din_t A[N], dsel_t width) {
dout_t out_accum=0;
dsel_t x;
LOOP_X:for (x=0;x<width; x++) {
out_accum += A[x];
}
return out_accum;
}
Attempting to optimize the design in the example above reveals the issues
created by variable loop bounds. The first issue with variable loop bounds is that they
prevent Vitis HLS from determining the latency of the
loop. Vitis HLS can determine the latency to complete
one iteration of the loop, but because it cannot statically determine the exact variable
width
, it does not know how many iterations are
performed and thus cannot report the loop latency (the number of cycles to completely
execute all iterations of the loop).
When variable loop bounds are present, Vitis HLS reports the latency as a question mark (?
) instead of using exact values. The following shows the result after the
synthesis of the previous example:
+ Summary of overall latency (clock cycles):
* Best-case latency: ?
* Worst-case latency: ?
+ Summary of loop latency (clock cycles):
+ LOOP_X:
* Trip count: ?
* Latency: ?
The way to overcome this issue is to use the LOOP_TRIPCOUNT pragma or directive
to specify a minimum and/or maximum iteration count for the loop. The tripcount is the
number of loop iterations. If a maximum tripcount of 32 is applied to LOOP_X
in the first example, the report is updated to the
following:
+ Summary of overall latency (clock cycles):
* Best-case latency: 2
* Worst-case latency: 34
+ Summary of loop latency (clock cycles):
+ LOOP_X:
* Trip count: 0 ~ 32
* Latency: 0 ~ 32
The user-provided values for the LOOP_TRIPCOUNT directive are used only for reporting, or to support the PERFORMANCE pragma or directive. The specified tripcount value allows Vitis HLS to determine latency values in the report, allowing values from different solutions to be compared. To have this same loop-bound information used for synthesis, the C/C++ code must be updated by using asserts, which impact synthesis (however, they must be used carefully since the assert condition is assumed to be true).
The next steps in optimizing the first example for a lower initiation interval are:
- Unroll the loop and allow the accumulations to occur in parallel.
- Partition the array input, or the parallel accumulations are limited by a single memory port.
If these code transformations are applied, the output from Vitis HLS highlights the most significant issue with variable bound loops:
WARNING: [HLS 200-936] Cannot unroll loop 'LOOP_X' (loop_var.cpp:22) in
function 'loop_var': cannot completely unroll a loop with a variable trip count.
Because variable bounds loops cannot be fully unrolled, they not only prevent the unroll directive from being applied, they also prevent pipelining the levels above the loop.
The solution to loops with variable bounds is to make the number of loop iteration a
fixed value with conditional executions inside the loop. The code from the variable loop
bounds example can be rewritten as shown in the following code example. Here, the loop
bounds are explicitly set to the maximum value of variable width
and
the loop body is conditionally executed:
#include "ap_int.h"
#define N 32
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
dout_t loop_max_bounds(din_t A[N], dsel_t width) {
dout_t out_accum=0;
dsel_t x;
LOOP_X:for (x=0; x<N; x++) {
if (x<width) {
out_accum += A[x];
}
}
return out_accum;
}
The for-loop (LOOP_X
) in the example above can
be fully unrolled. Because the loop has fixed upper bounds, Vitis HLS knows how much hardware to create. There are N(32)
copies of the loop body in the RTL design. Each copy
of the loop body has conditional logic associated with it and is executed depending on
the value of variable width
. Refer to Vitis-HLS-Introductory-Examples/Modeling/variable_bound_loops for an
example.