Data-driven Task-level Parallelism

Data-driven Task-level Parallelism - 2023.1 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID

UG1399

Release Date

2023-07-17

Version

2023.1 English

Data-driven task-level parallelism uses a task-channel modeling style that requires you to statically instantiate and connect tasks and channels explicitly. Tasks in this modeling style only have stream type inputs and outputs. The tasks are not controlled by any function call/return semantics but rather are always running waiting for data on their input stream.

Data-driven TLP models are tasks that execute when there is data to be processed. In Vitis HLS C-simulation used to be limited to seeing only the sequential semantics and behavior. With the data-driven model it is possible during simulation to see the concurrent nature of parallel tasks and their interactions via the FIFO channels.

Implementing data-driven TLP in the Vitis HLS tool uses simple classes for modeling tasks ( hls::task ) and channels (hls::stream/hls::stream_of_blocks)

Important: While Vitis HLS supports hls::tasks for a top-level function, you cannot use hls::stream_of_blocks for interfaces in top-level functions.

Consider the simple task-channel example shown below:

#include "test.h"
 
void splitter(hls::stream<int> &in, hls::stream<int> &odds_buf, hls::stream<int> &evens_buf) {
    int data = in.read();
    if (data % 2 == 0)
        evens_buf.write(data);
    else
        odds_buf.write(data);
}
 
void odds(hls::stream<int> &in, hls::stream<int> &out) {
    out.write(in.read() + 1);
}
 
void evens(hls::stream<int> &in, hls::stream<int> &out) {
    out.write(in.read() + 2);
}
 
void odds_and_evens(hls::stream<int> &in, hls::stream<int> &out1, hls::stream<int> &out2) {
    hls_thread_local hls::stream<int> s1; // channel connecting t1 and t2      
    hls_thread_local hls::stream<int> s2; // channel connecting t1 and t3
 
    // t1 infinitely runs function splitter, with input in and outputs s1 and s2 
    hls_thread_local hls::task t1(splitter, in, s1, s2); 
    // t2 infinitely runs function odds, with input s1 and output out1
    hls_thread_local hls::task t2(odds, s1, out1); 
    // t3 infinitely runs function evens, with input s2 and output out2 
    hls_thread_local hls::task t3(evens, s2, out2); 
}

The special hls::task C++ class is:

A new object declaration in your source code that requires a special qualifier. The hls_thread_local qualifier is required in order to keep the object (and the underlying thread) alive across multiple calls of the instantiating function (odds_and_evens in the example).

The hls_thread_local qualifier is only required to ensure that the C simulation of the data-driven TLP model exhibits the same behavior as the RTL simulation. In the RTL, these functions are already in always running mode once started. In order to ensure the same behavior during C Simulation, the hls_thread_local qualifier is required to ensure that each task is started only once and keeps the same state even when called multiple times. Without the hls_thread_local qualifier, each new invocation of the function would result in a new state.
Task objects implicitly manage a thread that runs a function infinitely, passing to it a set of arguments that must be either hls::stream or hls::stream_of_blocks
Tip: No other types of arguments are supported. In particular, even constant values cannot be passed as function arguments. If constants need to be passed to the task's body, define the function as a templated function and pass the constant as a template argument to this templated function.
The supplied function (splitter/odds/evens in the example above) is called the task body, and it has an implicit infinite loop wrapped around it to ensure that the task keeps running and waiting on input.
The supplied function can contain pipelined loops but they need to be flushable pipelines (FLP) in order to prevent deadlock. The tool will automatically select the right pipeline style to use for a given pipelined loop or function.

Important: An hls:task should not be treated as a function call - instead a hls::task needs to be thought of as a persistent instance statically bound to channels. Due to this, it will be your responsibility to ensure that multiple invocations to any function that contains hls::tasks be uniquified or these calls will use the same hls::tasks and channels.

Channels are modeled by the special templatized hls::stream (or hls::stream_of_blocks) C++ class. Such channels have the following attributes:

In the data-driven TLP model, an hls::stream<type,depth> object behaves like a FIFO with a specified depth. Such streams have a default depth of 2 which can be overridden by the user.
The streams are read from and written to sequentially. That implies that once a data item is read from an hls::stream<> that same data item cannot be read again.
Tip: Accesses to different streams are not ordered (e.g. the order of a write to a stream and a read from a different stream can be changed by the scheduler).
Streams may be defined either locally or globally. Streams defined in the global scope follow the same rules as any other global variables.
The hls_thread_local qualifier is also required for streams (s1 and s2 in the example below) in order to keep the same streams alive across multiple calls of the instantiating function (odds_and_evens in the code example below).

The following diagram shows the graphical representation in Vitis HLS of the code example above. In this diagram, the green colored arrows are FIFO channels while the blue arrows indicate the inputs and outputs of the instantiating function (odds_and_evens). Tasks are shown as blue rectangular boxes.

Figure 1. Dataflow Diagram of hls::task Example

Due to the fact that a read of an empty stream is a blocking read, deadlocks can occur due to:

The design itself, where the production and consumption rates by processes are unbalanced.
- During C simulation, deadlocks can occur only due to a cycle of processes, or a chain of processes starting from a top-level input, that are attempting to read from empty channels.
- Deadlocks can occur during both C/RTL Co-simulation and when running in hardware (HW) due to cycles of processes trying to write to full channels and/or reading from empty channels.
The test bench, which is providing less data than those that are needed to produce all the outputs that the test bench is expecting when checking the computation results.

Due to this, a deadlock detector is automatically instantiated when the design contains an hls::task. The deadlock detector detects deadlocks and stops the C simulation. Further debugging is performed using a C debugger such as gdb and looking at where the simulated hls::tasks are all blocked trying to read from an empty channel. Note that this is easy to do using the Vitis HLS GUI as shown in the handling_deadlock example for debugging deadlocks.

In summary, the hls::task model is recommended if your design requires a completely data-driven, pure streaming type of behavior, with no sort of control. This type of model is also useful in modeling feedback and dynamic multi-rate designs. Feedback in the design occurs when there is a cyclical dependency between tasks. Dynamic multi-rate models, where the producer writes data or consumer reads data at a rate that is data dependent, can only be handled by the data-driven TLP. The simple_data_driven design on GitHub is an example of this.

Note: Static multi-rate designs, in which the producer writes data or consumer reads data at rates that are data-independent, can be managed by both data-driven and control-driven TLP. For example, the producer writes two values in a stream for each call, the consumer reads one value per call.