This section dives into software pipelining of loops. This is an important concept that enables the AI Engine to concurrently execute different parts of a program. For example, a loop that requires a total of nine cycles to execute through one iteration is shown in the following figure, where sequential execution all the way to a full overlap pipelining is illustrated.
Counting the cycles through each of these examples, it is clear that the sequential execution requires 27 cycles to fully execute the three loop iterations, while the partially overlapped pipeline requires 13 cycles, and the fully pipelined loop requires only 11 cycles. From a performance perspective, it is therefore desirable to have a fully overlapping pipeline. However, this is not always possible, because resource constraints, as well as inter-iteration loop dependencies can prevent a full overlap (see the following figure).
In this example, the program performs load A (2 x 256-bit) in cycle 2, load B (2 x 256-bit) in cycle 3, and in cycle 6 and 7 it is executes operations on loop variable A. The remaining instructions of this iteration are of no importance with respect to the loop performance analysis.
Cycles 2 and 3 of this loop iteration execute 4 x 256-bit load operations. The required four loads are executed in two cycles because the AI Engines can only execute two loads per cycle. This is called a resource constraint. If the loop containing this iteration is supposed to be pipelined, this constraint limits the overlap to no less than two cycles. Similarly, code dependencies between iterations shown in cycle 6 and 7 can prevent additional overlap. In this case, the next iteration of the loop requires the value of A to be updated before it can be used by the loop, thus, limiting the overlap.
The aiecompiler
reports on each loop
in the following form.
-v
option is needed to generate the verbose report.HW do-loop #397 in "testbench.cc", line 132: (loop #16) :
Critical cycle of length 2 : b67 -> b68 -> b67
Minimum length due to resources: 2
Scheduling HW do-loop #397
(algo 1a) -> # cycles: 9
(modulo) -> # cycles: 2 ok (required budget ratio: 1)
(resume algo) -> after folding: 2 (folded over 4 iterations)
-> HW do-loop #397 in "testbench.cc", line 132: (loop #16) : 2 cycles
NOTICE: loop #397 contains folded negative edges
NOTICE: postamble created
Removing chess_separator blocks (all)
In the aiecompiler
report shown
previously, the section Critical cycle of length
provides feedback on code dependencies, while the Minimum
length due to resources
indicates minimum overlap requirement due to
resource constraints. The algo 1a
line states the
total amount of cycles for a single iteration. Given these numbers, there are a
maximum of five iterations active at a time creating the pipeline.
The aiecompiler
reports these five
overlapping iterations (the current iteration plus four folded iterations) in the
resume algo
line. In addition, it states the
initiation interval (II), the number of cycles a single iteration has to execute
before the following iteration is started, which is two in this example.
In general, it is sufficient to provide the directive chess_prepare_for_pipelining
to instruct the compiler
to attempt software pipelining. When the number of loop iterations is a compile time
constant, the chess compiler creates the optimum software pipeline.
In the case of a dynamic loop range (defined by a variable
start/end), the compiler requires additional information to create an effective
pipeline loop structure. This is performed through the directive chess_loop_range(<minimum>, <maximum>)
.
(algo 1a) -> # cycles: 167 (exceeds -k 64) -> no folding: 167
-> HW do-loop #511 in "xxxx", line 794: (loop #8): 167 cycles
aiecompiler
:--Xchess="main:backend.mist2.maxfoldk=200"
-v
option for the aiecompiler
in the command line.Modulo scheduling report can be generated for modulo scheduled loops
by specifying the option -Xchess=main:backend.mist2.xargs=-ggraph
for the aiecompiler
. Modulo scheduling report will be available
for software pipelined loop with the name *_modulo.rpt in
Work/aie/core_ID/Release/chesswork/<mangled_function_name>/*.rpt,
where * is the block name. The modulo scheduling report also contains the
information about register live ranges for register files, which may be useful to
find inefficiencies in register assignment and can be improved by using chess_storage
.
After compilation and linking is completed, you can open the compile log for an individual kernel in theVitis IDE. For more information, see the AI Engine Tools and Flows User Guide (UG1076).