Task parallelism allows you to take advantage of dataflow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.
See the following example:
void run (ap_uint<16> in[1024],
ap_uint<16> out[1024]
) {
ap_uint<16> tmp[128];
for(int i = 0; i<8; i++) {
processA(&(in[i*128]), tmp);
processB(tmp, &(out[i*128]));
}
}
When this code is executed, the function processA
and processB
are executed
sequentially 128 times in a row. Given the combined latency for processA
and processB
, the loop is set to
278 and the total latency can be estimated as:
The extra cycle is due to loop setup and can be observed in the Schedule Viewer.
For C/C++ code, task parallelism is performed by adding the DATAFLOW pragma into the for-loop:
#pragma HLS DATAFLOW
For OpenCL API code, add the attribute before the for-loop:
__attribute__ ((xcl_dataflow))
Refer to Dataflow Optimization, HLS Pragmas, and OpenCL Attributes for more details on this topic.
As illustrated by the estimates in the HLS report, applying the transformation will considerably improve the overall performance effectively using a double (ping-pong) buffer scheme between the tasks:
The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:
(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles
Using task parallelism is a powerful method to improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. It is often necessary to look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Finally, the Vitis core development kit provides the Detailed Kernel Trace, which illustrates concurrent execution.
For this Detailed Kernel Trace, the
tool displays the start of the dataflow loop, as shown in the previous figure. It
illustrates how processA
is starting up right away with the beginning
of the loop, while processB
waits until the completion of the
processA
before it can start up its first iteration. However, while
processB
completes the first iteration of the loop,
processA
begins operating on the second iteration, etc.
A more abstract representation of this information is presented in Application Timeline for the host and device activity.