Some of the optimizations that Vitis HLS can apply are prevented when the loop has variable bounds.
In the following code example, the loop bounds are determined by variable width
, which is driven from a top-level input. In this
case, the loop is considered to have variables bounds, because Vitis HLS cannot know when the loop will complete.
#include "ap_int.h"
#define N 32
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
dout_t code028(din_t A[N], dsel_t width) {
dout_t out_accum=0;
dsel_t x;
LOOP_X:for (x=0;x<width; x++) {
out_accum += A[x];
}
return out_accum;
}
Attempting to optimize the design in the example above reveals the issues created by variable loop bounds. The first issue with variable loop bounds is that they prevent Vitis HLS from determining the latency of the loop. Vitis HLS can determine the latency to complete one iteration of the loop, but because it cannot statically determine the exact value of variable width, it does not know how many iterations are performed and thus cannot report the loop latency (the number of cycles to completely execute every iteration of the loop).
When variable loop bounds are present, Vitis HLS reports the latency as a question mark (?
) instead of using exact values. The following shows
the result after synthesis of the example above.
+ Summary of overall latency (clock cycles):
* Best-case latency: ?
* Worst-case latency: ?
+ Summary of loop latency (clock cycles):
+ LOOP_X:
* Trip count: ?
* Latency: ?
Another issue with variable loop bounds is that the performance of the design is unknown. The two ways to overcome this issue are as follows:
- Use the pragma HLS loop_tripcount or set_directive_loop_tripcount.
- Use an
assert
macro in the C/C++ code.
The tripcount
directive allows a
minimum and/or maximum tripcount
to be specified
for the loop. The tripcount
is the number of loop iterations. If a
maximum tripcount
of 32 is applied to LOOP_X
in the first example, the report is updated to the
following:
+ Summary of overall latency (clock cycles):
* Best-case latency: 2
* Worst-case latency: 34
+ Summary of loop latency (clock cycles):
+ LOOP_X:
* Trip count: 0 ~ 32
* Latency: 0 ~ 32
The user-provided values for the tripcount
directive are used only for reporting. The tripcount
value allows Vitis HLS to report number in the report, allowing the reports from
different solutions to be compared. To have this same loop-bound information used
for synthesis, the C/C++ code must be updated.
The next steps in optimizing the first example for a lower initiation interval are:
- Unroll the loop and allow the accumulations to occur in parallel.
- Partition the array input, or the parallel accumulations are limited, by a single memory port.
If these optimizations are applied, the output from Vitis HLS highlights the most significant issue with variable bound loops:
@W [XFORM-503] Cannot unroll loop 'LOOP_X' in function 'code028': cannot completely
unroll a loop with a variable trip count.
Because variable bounds loops cannot be unrolled, they not only prevent the unroll directive being applied, they also prevent pipelining of the levels above the loop.
The solution to loops with variable bounds is to make the number of loop iteration a fixed value with conditional executions inside the loop. The code from the variable loop bounds example can be rewritten as shown in the following code example. Here, the loop bounds are explicitly set to the maximum value of variable width and the loop body is conditionally executed:
#include "ap_int.h"
#define N 32
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
dout_t loop_max_bounds(din_t A[N], dsel_t width) {
dout_t out_accum=0;
dsel_t x;
LOOP_X:for (x=0; x<N; x++) {
if (x<width) {
out_accum += A[x];
}
}
return out_accum;
}
The for-loop (LOOP_X
) in the example
above can be unrolled. Because the loop has fixed upper bounds, Vitis HLS knows how much hardware to create. There are
N(32)
copies of the loop body in the RTL
design. Each copy of the loop body has conditional logic associated with it and is
executed depending on the value of variable width.