The AI Engine has a zero-overhead loop structure that does not incur any branch control overhead for comparison and branching thus reducing the inner loop cycle count. Pipelining allows the compiler to add pre-amble and post-amble so that the instruction pipeline is always full during loop execution. With a pipelined loop, a new iteration can be started before the previous one ends to achieve higher instruction level parallelism.
The following figure shows the assembly code of a zero-overhead loop.
The following pragmas work together to direct the compiler to pipeline the loop and let it know that the loop is always executed at least three times.
for (int i=0; i<N; i+=2)
chess_prepare_for_pipelining
chess_loop_range(3,)
The chess_loop_range(<minimum>,
<maximum>) tells the compiler that the corresponding loop is
executed at least <minimum> times, and at
most <maximum> times, where <minimum> and <maximum> are non-negative constant expressions, or can be
omitted. When omitted, <minimum> defaults to
0, and <maximum> defaults to the maximum
preset in the compiler. While <maximum> is
not relevant for the pipeline implementation, <minimum> guides the pipeline implementation.
The <minimum> number defines
how many loop iterations are executed at a minimum each time the loop is executed.
The software pipeline is then tuned to allow at least that many iterations to
execute in parallel if possible. It also determines that checking the boundaries for
the loop is not necessary before the <minimum> number of iterations are executed.
The loop range pragma is not needed if the loop range is a compile
time constant. In general, the AI Engine compiler reports the theoretical number best suited
for optimum pipelining of an algorithm. If the range specification is not optimal,
the compiler would issue a warning and suggest the optimal range. Towards that end,
it is okay to initially set the <minimum> to
one [chess_loop_range(1,)] and observe the
theoretical best suited <minimum> being
reported by the compiler.
Warning in "matmul_vec16.cc", line 10: (loop #39)
further loop software pipelining (to 4 cycles) is feasible with `chess_prepare_for_pipelining'
but requires a minimum loop count of 3
... consider annotating the loop with `chess_loop_range(3,)' if applicable,
... or remove the current `chess_loop_range(1,)` pragma
At this point, you can choose to update the <minimum> number to the reported optimum.
This second part of the pipeline implementation can be a reason for
potential deadlocks in the AI Engine kernels if the actual <minimum> number of iterations is not reached. For this reason,
you must ensure that the number of iterations is always at least the number
specified in the chess_loop_range directive.
The compiler also provides C++ compatible attribute syntax. The
directives starting with prefix chess_ can be
specified as attributes in a C++ attribute list: [[chess::]]. For example:
[[chess::prepare_for_pipelining, chess::min_loop_count(3)]]
for (int i=0; i<N; i+=2)
Loop carried dependencies impact the vectorization of code. If an inner loop dependency cannot be removed, a strategy to step out a level and manually unroll where there are (effectively) multiple copies of the inner loop running in parallel.
Try to avoid sequential load operations to fill a vector register
completely before use. It is best to interleave loads with vector operation functions, where the MAC and loads can be done in the
same cycle.
In certain use cases loop rotation, which rotates the instructions inside the loop, can be beneficial. Instead of loading data into a vector at the start of a loop, consider loading a block of data for the first iteration before the loop, and then for the next iteration near the end of the loop. This will add additional instructions but shorten the dependency length of the loop which helps to achieve an ideal loop with a potentially lower loop range.