Control-driven TLP is useful to model parallelism while relying on the sequential semantics of C++, rather than on continuously running threads. Examples include functions that can be executed in a concurrent pipelined fashion, possibly within loops, or with arguments that are not channels but C++ scalar and array variables, both referring to on-chip and to off-chip memories. For this kind of model, Vitis HLS introduces parallelism where possible while preserving the behavior obtained from the original C++ sequential execution. The control-driven TLP (or dataflow) model provides:
- A subsequent function can start before the previous finishes
- A function can be restarted before it finishes
- Two or more sequential functions can be started simultaneously
While using the dataflow model, Vitis HLS implements the sequential semantics of the C++ code by automatically inserting synchronization and communication mechanisms between the tasks.
The dataflow model takes this series of sequential functions and creates a
task-level pipeline architecture of concurrent processes. The tool does this by
inferring the parallel tasks and channels. The designer specifies the region to model in
the dataflow style (for example, a function body or a loop body) by specifying the DATAFLOW pragma or directive as
shown below. The tool scans the loop/function body, extracts the parallel tasks as
parallel processes, and establishes communication channels between these processes. The
designer can additionally guide the tool to select the type of channels - for example,
FIFO (hls::stream
or #pragma
HLS STREAM
) or PIPO or hls::stream_of_blocks
. The dataflow model is a powerful method for
improving design throughput and latency.
In order to understand how Vitis HLS
transforms your C++ code into the dataflow model, refer to the simple_fifos example shown below. The
example applies the dataflow model to the top-level diamond
function using the DATAFLOW pragma as shown.
#include "diamond.h"
void diamond(data_t vecIn[N], data_t vecOut[N])
{
data_t c1[N], c2[N], c3[N], c4[N];
#pragma HLS dataflow
funcA(vecIn, c1, c2);
funcB(c1, c3);
funcC(c2, c4);
funcD(c3, c4, vecOut);
}
In the above example, there are four functions: funcA
, funcB
, funcC
, and funcD
. funcB
and funcC
do not have any data
dependencies between them and therefore can be executed in parallel. funcA
reads from the non-local memory (vecIn
) and needs to be executed first. Similarly, funcD
writes to the non-local memory (vecOut
) and therefore has to be executed last.
The following waveform shows the execution profile of this design without the
dataflow model. There are three calls to the function diamond from the test bench.
funcA
, funcB
,
funcC
, and funcD
are executed in sequential order. Each call to diamond, therefore, takes 475 cycles in
total as shown in the following figure.
In the following figure, when the dataflow model is applied and the designer selected to use FIFOs for channels, all the functions are started immediately by the controller and are stalled waiting on input. As soon as the input arrives, it is processed and sent out. Due to this type of overlap, each call to diamond now only takes 275 cycles in total as shown below. Refer to Combining the Three Paradigms for a more detailed discussion of the types of parallelism that can be achieved for this example.
This type of parallelism cannot be achieved without incurring an increase
in hardware utilization. When a particular region, such as a function body or a loop
body, is identified as a region to apply the dataflow model, Vitis HLS analyzes the function or loop body and creates individual
channels from C++ variables (such as scalars, arrays, or user-defined channels such as
hls::streams
or hls::stream_of_blocks
) that model the flow of data in the dataflow region.
These channels can be simple FIFOs for scalar variables, or ping-pong (PIPO) buffers for
non-scalar variables like arrays (or stream of blocks when you need a combination of
FIFO and PIPO behavior with explicit locking of the blocks).
Each of these channels can contain additional signals to indicate when the channel is full or empty. By having individual FIFOs and/or PIPO buffers, Vitis HLS frees each task to execute independently and the throughput is only limited by the availability of the input and output buffers. This allows for better overlapping of task execution than a normal pipelined implementation, but does so at the cost of additional FIFO or block RAM registers for the ping-pong buffer.
The dataflow model is not limited to a chain of processes but can be used
on any directed acyclic graph (DAG) structure, or cyclic structure when using hls::streams
. It can produce two different forms of
overlapping: within an iteration if processes are connected with FIFOs, and across
different iterations when connected with PIPOs and FIFOs. This potentially improves
performance over a statically pipelined solution. It replaces the strict,
centrally-controlled pipeline stall philosophy with a distributed handshaking
architecture using FIFOs and/or PIPOs. The replacement of the centralized control
structure with a distributed one also benefits the fanout of control signals, for
example register enables, which is distributed among the control structures of
individual processes. Refer to the Task Level Parallelism/Control-Driven
examples on GitHub for more examples of these concepts.