When pipelining loops, the optimal balance between area and performance is typically found by pipelining the innermost loop. This is also results in the fastest runtime. The following code example demonstrates the trade-offs when pipelining loops and functions.
#include "loop_pipeline.h"
dout_t loop_pipeline(din_t A[N]) {
int i,j;
static dout_t acc;
LOOP_I:for(i=0; i < 20; i++){
LOOP_J: for(j=0; j < 20; j++){
acc += A[i] * j;
}
}
return acc;
}
If the innermost (LOOP_J
) is
pipelined, there is one copy of LOOP_J
in hardware,
(a single multiplier). Vitis HLS automatically
flattens the loops when possible, as in this case, and effectively creates a new
single loop of 20*20 iterations. Only one multiplier operation and one array access
need to be scheduled, then the loop iterations can be scheduled as a single
loop-body entity (20x20 loop iterations).
If the outer-loop (LOOP_I
) is
pipelined, inner-loop (LOOP_J
) is unrolled creating
20 copies of the loop body: 20 multipliers and 20 array accesses must now be
scheduled. Then each iteration of LOOP_I
can be
scheduled as a single entity.
If the top-level function is pipelined, both loops must be unrolled:
400 multipliers and 400 arrays accessed must now be scheduled. It is very unlikely
that Vitis HLS will produce a design with 400
multiplications because in most designs, data dependencies often prevent maximal
parallelism, for example, even if a dual-port RAM is used for A[N]
, the design can only access two values of A[N]
in any clock cycle.
The concept to appreciate when selecting at which level of the hierarchy to pipeline is to understand that pipelining the innermost loop gives the smallest hardware with generally acceptable throughput for most applications. Pipelining the upper levels of the hierarchy unrolls all sub-loops and can create many more operations to schedule (which could impact runtime and memory capacity), but typically gives the highest performance design in terms of throughput and latency.
To summarize the above options:
- Pipeline
LOOP_J
Latency is approximately 400 cycles (20x20) and requires less than 100 LUTs and registers (the I/O control and FSM are always present).
- Pipeline
LOOP_I
Latency is approximately 20 cycles but requires a few hundred LUTs and registers. About 20 times the logic as first option, minus any logic optimizations that can be made.
- Pipeline function
loop_pipeline
Latency is approximately 10 (20 dual-port accesses) but requires thousands of LUTs and registers (about 400 times the logic of the first option minus any optimizations that can be made).