Description
Allows inner nested loops to be collapsed (flattened) into a single loop so that pipelining can be applied on all iterations of the loops with the goal of achieving better latency.
Only innermost loops (after possible unrolling inside) can be pipelined. Outer loops can be only dataflow or sequential. When a loop above a pipelined loop is sequential, its iterations are executed in sequence and, for each iteration, the inner loop is fully executed once. In the RTL implementation, this requires one clock cycle to move from the outer loop to the inner loop, one clock cycle to move back from the inner loop to the outer loop, plus, in between, the whole latency to complete all iterations of the inner loop. On the contrary, flattening nested loops allows them to be optimized and pipelined as a single loop so that overlapping (pipelining) between the iterations of the outer loop and the iterations of different calls to the inner loop can occur. In general, flattening improves performance, but:
- Flattening cannot always be achieved, some coding style needs to be followed.
- In some cases, timing and even II can be degraded, depending on the operations and/or dependencies in the outer loop.
Apply the LOOP_FLATTEN pragma to the loop body of the innermost loop in the loop hierarchy. Only loops that are perfect and almost-perfect nested loops (after possible preliminary function inline or loop unrolling) can be flattened in this manner:
- Perfect loop nests
-
- The body of each non-innermost loop contains one and only one subloop, and no other instructions.
- Almost-perfect loop nests
-
- The body of each non-innermost loop contains one and only one subloop, and no other control flow.
- The body of each non-innermost loop must should contain no function call containing a loop.
For almost-perfect loops, the compiler pushes automatically into the innermost loop any instructions that exist between the two loops so that the loops are perfectly nested.
In addition, some flatten ability requirements are needed:
- Each loop should be a for-loop, not a while-loop, and without break statements with a single exiting block.
- The tripcount of each loop should be computable by the compiler before the loops to be
flattened (it does not need to be a numerical constant). A typical coding style is:
- A loop with a loop counter incremented by a numerical constant.
- A lower bound and upper bound for the loop counter that do not depend on the loops to be flattened (they should be loop-invariant).
An example of non flatten ability is a loop whose inner tripcount depends on the outer loop counter.
Imperfect loop nests (for example when loops contain more than one subloop or control flow) cannot be flattened by the compiler. In this case, flattening needs to be done by hand by restructuring the code, pushing instructions in the innermost loop, or unrolling inner loops to create a perfect loop nest above.
Syntax
Place the pragma in the C source in the (typically innermost) loop to be flattened with (perfect or almost-perfect) loops above, as long as the flatten ability requirements are fulfilled.
#pragma HLS loop_flatten
Options:
-
off - Optional keyword. Prevents flattening the loop that contains loop_flatten off with its subloops (if any). If loop_flatten off is placed in the innermost loop, no flattening (not even auto-flattening) occurs.
Example 1
Place the pragma in the body of loop_1 to flatten, into a
single loop, loop_1 in function foo with the (perfect or
almost-perfect) loops above it in the loop hierarchy (here loop_0), as long
as flatten ability requirements are fulfilled.
void foo (N, M, ...) {
int i, j;
loop_0: for (i=0; i<N; i++) {
loop_1: for (j=0; j<M; j++) {
#pragma HLS loop_flatten
...
}
}
}
Example 2
Prevents loop flattening of loop_1with loop_0. Only
loop_1 and loop_2 are flattened together.
void foo (N, M, P, ...) {
int i, j;
loop_0: for (i=0; i<N; i++) {
#pragma HLS loop_flatten off
loop_1: for (j=0; j<M; j++) {
loop_2: for (k=0; k<P; k++) {
#pragma HLS loop_flatten
...
}
}
}
}
Example 3
With more than two nested loops above a loop_flatten pragma, the compiler can decide not to
flatten all loops if a possible second degradation is anticipated or if it needs to know
(but cannot prove it) that the inner loop iterates at least once. In this case, flattening
can be forced with an additional loop_flatten pragma in the loop to be flattened with its
surrounding loop, here inloop_1to force the flattening with
loop_0.
void foo (N, M, P, ...) {
int i, j;
loop_0: for (i=0; i<N; i++) {
... // some instruction with side effect
loop_1: for (j=0; j<M; j++) {
#pragma HLS loop_flatten
loop_2: for (k=0; k<P; k++) {
#pragma HLS loop_flatten
...
}
}
}
}
Remarks:
- With a loop_flatten pragma and if some instruction with side effects need to be pushed in the loop (for almost-perfect loops), the compiler assumes that the loop containing the pragma iterates at least once.
- When flattening, any dependence pragma (false or with distance) in the inner loop is understood as to be applied to the single loop obtained after flattening.
Example 4
The following example shows almost-perfect loops that can be flattened.
void foo (...) {
int i, j;
loop_0: for (i=0; i<N; i++) {
int s = 0;
loop_1: for (j=0; j<M; j++) {
#pragma HLS loop_flatten
s += a[i][j];
b[i][j] = s;
}
}
}
Example 5
The following example shows loops that are not flattened.
void foo (...) {
int i, j;
loop_0: for (i=0; i<N; i++) {
loop_1: for (j=i; j<M; j++) {
#pragma HLS loop_flatten
...
}
}
}