- A subsequent function can start before the previous finishes
- A function can be restarted before it finishes
- Two or more sequential functions can be started simultaneously
While using the dataflow model, Vitis HLS implements the sequential semantics of the C++ code by automatically inserting synchronization and communication mechanisms between the tasks.
The dataflow model takes this series of sequential functions and creates a
task-level pipeline architecture of concurrent processes. The tool does this by
inferring the parallel tasks and channels. The designer specifies the region to model in
the dataflow style (i.e., a function body or a loop body) by specifying the DATAFLOW pragma or directive as shown below. The
tool scans the loop/function body, extracts the parallel tasks as parallel processes,
and establishes communication channels between these processes. The designer can
additionally guide the tool to select the type of channels - i.e., FIFO (hls::stream
or #pragma HLS
STREAM
) or PIPO or hls::stream_of_blocks
.
The dataflow model is a powerful method for improving design throughput and latency.
In order to understand how Vitis HLS
transforms your C++ code into the dataflow model, refer to the simple_fifos example shown below. The
example applies the dataflow model to the top-level diamond
function using the DATAFLOW pragma as shown.
#include "diamond.h"
void diamond(data_t vecIn[N], data_t vecOut[N])
{
data_t c1[N], c2[N], c3[N], c4[N];
#pragma HLS dataflow
funcA(vecIn, c1, c2);
funcB(c1, c3);
funcC(c2, c4);
funcD(c3, c4, vecOut);
}
In the above example, there are four functions:
funcA,funcB,funcC
and funcD
. funcB
and funcC
do not
have any data dependencies between them and therefore can be executed in parallel.
funcA
reads from the non-local memory (vecIn
) and needs to be executed first. Similarly, funcD
writes to the non-local memory (vecOut
) and therefore has to be executed last.
The following waveform shows the execution profile of this design without the
dataflow model. There are three calls to the function diamond from the test bench.
funcA,(funcB, funcC)
and funcD
are executed in
sequential order. Each call to diamond, therefore, takes 475 cycles in total as shown in
the figure below.
In the following figure, when the dataflow model is applied and the designer selected to use FIFOs for channels, all the functions are started immediately by the controller and are stalled waiting on input. As soon as the input arrives, it is processed and sent out. Due to this type of overlap, each call to diamond now only takes 275 cycles in total as shown below. Refer to Combining the Three Paradigms for a more detailed discussion of the types of parallelism that can be achieved for this example.
This type of parallelism cannot be achieved without incurring an increase in
hardware utilization. When a particular region, such as a function body or a loop body,
is identified as a region to apply the dataflow model, Vitis HLS analyzes the function or loop body and creates individual
channels from C++ variables (such as scalars, arrays, or user-defined channels such as
hls::streams
or hls::stream_of_blocks
) that model the flow of data in the dataflow region.
These channels can be simple FIFOs for scalar variables, or ping-pong (PIPO) buffers for
non-scalar variables like arrays (or stream of blocks when you need a combination of
FIFO and PIPO behavior with explicit locking of the blocks).
Each of these channels can contain additional signals to indicate when the channel is full or empty. By having individual FIFOs and/or PIPO buffers, Vitis HLS frees each task to execute independently and the throughput is only limited by the availability of the input and output buffers. This allows for better overlapping of task execution than a normal pipelined implementation, but does so at the cost of additional FIFO or block RAM registers for the ping-pong buffer.
The dataflow model is not limited to a chain of processes but can be used on
any directed acyclic graph (DAG) structure, or cyclic structure when using
hls::streams
. It can produce two different forms of overlapping:
within an iteration if processes are connected with FIFOs, and across different
iterations when connected with PIPOs and FIFOs. This potentially improves performance
over a statically pipelined solution. It replaces the strict, centrally-controlled
pipeline stall philosophy with a distributed handshaking architecture using FIFOs and/or
PIPOs. The replacement of the centralized control structure with a distributed one also
benefits the fanout of control signals, for example register enables, which is
distributed among the control structures of individual processes. Refer to the Task Level Parallelism/Control-Driven
examples on Github for more examples of these concepts.