When writing code intended for high-level synthesis (HLS), there is a frequent need to implement repetitive algorithms that process blocks of data — for example, signal, or image processing. Typically, the C/C++ source code tends to include several loops or several nested loops.
When it comes to optimizing performance, loops are one of the best places to start exploring optimization. Each iteration of the loop takes at least one clock cycle to execute in hardware. Thinking from the hardware perspective, there is an implicit wait until clock for the loop body. The next iteration of a loop only starts when the previous iteration is finished. To improve performance loops can generally be either pipelined or unrolled to take advantage of the highly distributed and parallel FPGA architecture, as explained in the following sections.