When Vitis HLS finds the #pragma HLS dataflow, it converts the function or loop body where the pragma occurs (called "dataflow region" in the following) into a set of processes.
This set of processes can be identified:
- Either by the designer, who either calls one function or instantiates one hls::task per
process,
e.g.:
void dataflow(int Input0, int Input1[], int &Output0, int Output1[]) { #pragma HLS dataflow int C0, C1[N], C2; func1(Input0, Input1, C0, C1); // first process func2(C0, C1, C2); // second process func3(C2, Output0, Output1); // third process } - Or by the tool, if the dataflow region contains code other than function calls,
e.g.:
void dataflow(int Input0, int Input1[], int &Output0, int Output1[]) { #pragma HLS dataflow int C0, C1[N], C2; C0 = Input0 * 3; // first process for (int i = 0; i < N; i++) { // first process or second process? C1[i] = Input1[0] + 2; } func2(C0, C1, C2); // second process or third process? func3(C2, Output0, Output1); // third process or fourth process? }
The first form produces a resulting dataflow network that is more predictable, in terms of both performance and structure, while the second one is much less predictable, because there can be more than one choice.
This section discusses how to achieve good predictability by following a set of coding style guidelines (also known as "canonical dataflow").
These guidelines will be discussed at three levels, from basic to advanced.
The Vitis HLS front-end generates warning whenever the coding style rules are violated.
In this case, Vitis HLS does its best to implement the resulting dataflow (or errors out if the violations are fatal, e.g. when an array has multiple writer processes or multiple reader processes).
But one must always check the GUI dataflow viewer and the cosimulation timeline trace to ensure that the dataflow happens as expected and the achieved performance is as expected.
In particular any non-dataflow hierarchy level (i.e. sequential FSM) within the dataflow hierarchy typically indicates loss of performance, since they do not allow overlapped execution of the dataflow regions that they contain.
Basic Level
At the most basic level, the dataflow pragma can be applied in two different contexts:
-
Inside a function (also known as function dataflow region)
-
Inside a special kind of for loop (also known as loop dataflow region) such that:
- The loop is the only statement inside the body of a function
- Variable declarations outside the loop are forbidden
-
The loop variable is of type integer
- The initial value is declared in the loop header and set to any non-negative integer constant.
- The exit condition is an "<" comparison with a non-negative numerical constant or a scalar argument of the function that encloses the loop.
- The loop variable is incremented by any positive numerical integer constant.
- The loop is the only statement inside the body of a function
The recommended coding style for predictability is that the body of that function or for loop must contain only a sequence of
- sub-function calls (using ap_ctrl_chain) and/or
- hls::task declarations (using ap_ctrl_none).
There cannot be any control inside a dataflow region (if-then-elses and loops would be automatically converted into processes, which results in the unpredictability of dataflow structure discussed above).
These called sub-functions and hls::tasks:
- must have a return type void
- can be:
- Sequential functions or pipelined functions, called processes, or
- Function dataflow regions, or
- Loop dataflow regions.
Other syntactic restrictions:
-
Variable initialization (including those performed automatically by constructors) is not allowed:
- For standard data types that have default constructors that cannot be
redefined (for example:
std::complex), one can avoid initialization by using the no_ctor attribute- For example:
std::complex<float> arr[SIZE] __attribute__((no_ctor));
- For example:
- For standard data types that have default constructors that cannot be
redefined (for example:
-
Passing expressions by value to processes is not allowed.
-
In the example above, the first process must be declared as follows (note the passing of Input1 by reference):
void func1(int Input0, int Input1[], int &Output0, int Output1[])
-
Example of recommended style for a dataflow function:
void dataflow(int Input0, int Input1[], int &C0, int C1[]) {
#pragma HLS dataflow
int C1[N], C2; // no initialization
UserDataType C0 __attribute__((no_ctor)); // no_ctor must be used if the default constructor is not empty
func1(Input0, Input1, C0, C1); // read Input0, read Input1, write C0, write C1
func2(C0, C1, C2); // read C0, read C1, write C2
func3(C2, Output0, Output1); // read C2, write Output0, write Output1
}
Example of recommended style for dataflow in loop:
void dataflow(int Input0, int Input1[], int &Output0, int Output1[]) {
for (int i = 0; i < N; i++) {
#pragma HLS dataflow
int C1[N], C2; // no initialization
UserDataType C0 __attribute__((no_ctor)); // no_ctor must be used if the default constructor is not empty
func1(Input0, Input1, C0, C1, i); // read Input0, read Input1, write C0, write C1
func2(C0, C1, C2, i); // read C0, read C1, write C2
func3(C2, Output0, Output1, i); // read C2, write Output0, write Output1
}
}
Further semantic restrictions about the body of dataflow functions or regions:
- Local variables must be non-static scalars or arrays (static variables
are allowed only inside called processes).
- Instances of hls::tasks and hls::threads in a canonical region must be
declared as hls_thread_local (e.g.
hls_thread_local hls::task t1(proc, arg1, arg2, arg3);).- Note that hls_thread_local is like static, but it is safer since it does not imply a shared single variable instance among multiple hls::tasks with the same function body.
- Instances of hls::tasks and hls::threads in a canonical region must be
declared as hls_thread_local (e.g.
- The sequence of function calls and/or hls::tasks must transfer data using local
variables, under the following conditions:
-
Arrays:
- Can have only one writer process and one reader process.
- The writer must be lexically before the reader (i.e. no loop-carried dependences)
-
hls::streamsandhls::stream_of_blocks:- Can have only one reading process and one writing process.
- Can be used also to transfer data backwards (to processes that are lexically earlier)
- Care must be taken to ensure that processes that
read data these feedback streams do not attempt to read from
them until some data has been produced by later processes.
- The most common way to satisfy this requirement is to use an hls_thread_local variable inside these processes, using it to skip the reading on the first execution
- Using non-blocking reads for the same purpose is strongly discouraged, because it inherently leads to non-deterministic implementation and cosimulation mismatches.
- Scalars:
- can have multiple writer and reader processes
- are automatically converted into FIFO channels
- cannot have loop-carried dependences in case of dataflow-in-loop
-
Advanced Level
In addition to the basic level discussed above, the recommended predictable dataflow coding style also supports:
- Fully unrolled loops within a dataflow region. The loop bodies can contain only function calls and hls::task instantiations.
- Fully partitioned arrays, where each partition is passed as an independent variable (array or scalar) to one process.
For example this code creates a pipeline of N identical processes all performing the same functionality and cascaded via a chain of hls::streams:
void dut(int in[M], int out[M]) {
#pragma HLS dataflow
hls_thread_local hls::stream<int> chan[N+1]; // arrays of hls::streams are fully partitioned
read_in(in, chan[0]);
hls_thread_local hls::task t[N]; // array of worker processes
for (int i=0; i<N; i++) {
#pragma HLS unroll
t[i](worker, chan[i], chan[i+1]);
}
write_out(chan[N], out);
}
This is another example that instead uses a chain of PIPOs (or streamed arrays if one adds #pragma HLS stream variable=chan):
void dut(int in[M], int out[M]) {
#pragma HLS dataflow
int chan[N+1][M]; / / partitioned into N+1 arrays of M elements
#pragma HLS array_partition complete dim=1 variable=chan
read_in(in, chan[0]);
for (int i=0; i<N; i++) {
#pragma HLS unroll
worker(chan[i], chan[i+1]);
}
write_out(chan[N], out);
}