The DATAFLOW optimization optimizes the flow of data between tasks (functions and loops), and ideally pipelined functions and loops for maximum performance. It does not require these tasks to be chained, one after the other, however there are some limitations in how the data is transferred.
The following behaviors can prevent or limit the overlapping that Vitis HLS can perform with DATAFLOW optimization:
- Reading from function inputs or writing to function outputs in the middle of the dataflow region
- Single-producer-consumer violations
- Bypassing tasks and channel sizing
- Feedback between tasks
- Conditional execution of tasks
- Loops with multiple exit conditions
Reading from Inputs/Writing to Outputs
Reading of inputs of the function should be done at the start of the dataflow region, and writing to outputs should be done at the end of the dataflow region. Reading/writing to the ports of the function can cause the processes to be executed in sequence rather than in an overlapped fashion, adversely impacting performance.
Single-producer-consumer Violations
For Vitis HLS to perform the DATAFLOW
optimization, all elements passed between tasks must follow a single-producer-consumer
model. Each variable must be driven from a single task and only be consumed by a single
task. In the following code example, temp1
fans out and is
consumed by both Loop2
and Loop3
. This violates the single-producer-consumer model.
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
}
Loop2: for(int j = 0; j < N; j++) {
data_out1[j] = temp1[j] * 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out2[k] = temp1[k] * 456;
}
}
A modified version of this code uses function Split
to create a single-producer-consumer design. The following code block
example shows how the data flows with the function Split
.
The data now flows between all four tasks, and Vitis HLS
can perform the DATAFLOW optimization.
void Split (in[N], out1[N], out2[N]) {
// Duplicated data
L1:for(int i=1;i<N;i++) {
out1[i] = in[i];
out2[i] = in[i];
}
}
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N], temp2[N]. temp3[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
}
Split(temp1, temp2, temp3);
Loop2: for(int j = 0; j < N; j++) {
data_out1[j] = temp2[j] * 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out2[k] = temp3[k] * 456;
}
}
Bypassing Tasks and Channel Sizing
In addition, data should generally flow from one task to another. If you
bypass tasks, this can reduce the performance of the DATAFLOW optimization. In the following
example, Loop1
generates the values for temp1
and temp2
. However, the
next task, Loop2
, only uses the value of temp1
. The value of temp2
is
not consumed until after
Loop2
. Therefore, temp2
bypasses the next task in the sequence, which can limit the performance of the DATAFLOW
optimization.
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N], temp2[N]. temp3[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
temp2[i] = data_in[i] >> scale;
}
Loop2: for(int j = 0; j < N; j++) {
temp3[j] = temp1[j] + 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp2[k] + temp3[k];
}
}
temp2
to be 3, instead of the default depth of 2.
This lets the buffer store the value intended for Loop3
,
while Loop2
is being executed. Similarly, a PIPO that
bypasses two processes should have a depth of 4. Set the depth of the buffer with the
STREAM
pragma or
directive:#pragma HLS STREAM type=pipo variable=temp2 depth=3
Feedback between Tasks
Feedback occurs when the output from a task is consumed by a previous task in the DATAFLOW region. Feedback between tasks is not recommended in a DATAFLOW region. When Vitis HLS detects feedback, it issues a warning, depending on the situation, and might not perform the DATAFLOW optimization.
However, DATAFLOW can support feedback when used with hls::streams
. The following example demonstrates this
exception.
#include "ap_axi_sdata.h"
#include "hls_stream.h"
void firstProc(hls::stream<int> &forwardOUT, hls::stream<int> &backwardIN) {
static bool first = true;
int fromSecond;
//Initialize stream
if (first)
fromSecond = 10; // Initial stream value
else
//Read from stream
fromSecond = backwardIN.read(); //Feedback value
first = false;
//Write to stream
forwardOUT.write(fromSecond*2);
}
void secondProc(hls::stream<int> &forwardIN, hls::stream<int> &backwardOUT) {
backwardOUT.write(forwardIN.read() + 1);
}
void top(...) {
#pragma HLS dataflow
hls::stream<int> forward, backward;
firstProc(forward, backward);
secondProc(forward, backward);
}
In this simple design, when firstProc
is
executed, it uses 10 as an initial value for input. Because hls::streams
do not support an initial value, this technique can be used to
provide one without violating the single-producer-consumer rule. In subsequent iterations
firstProc
reads from the hls::stream
through the backwardIN
interface.
firstProc
processes the value and sends
it to secondProc
, via a stream that goes forward in terms of the original C++ function execution order.
secondProc
reads the value on forwardIN
, adds 1 to it, and sends it back to firstProc
via the feedback stream that goes backwards in the execution order.
From the second execution, firstProc
uses the value read from the stream to do its computation, and the two processes can keep
going forever, with both forward and feedback communication, using an initial value for the
first execution.
Conditional Execution of Tasks
The DATAFLOW optimization does not optimize tasks that are conditionally
executed. The following example highlights this limitation. In this example, the conditional
execution of Loop1
and Loop2
prevents Vitis HLS from optimizing the
data flow between these loops, because the data does not flow from one loop into the
next.
void foo(int data_in1[N], int data_out[N], int sel) {
int temp1[N], temp2[N];
if (sel) {
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * 123;
temp2[i] = data_in[i];
}
} else {
Loop2: for(int j = 0; j < N; j++) {
temp1[j] = data_in[j] * 321;
temp2[j] = data_in[j];
}
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp1[k] * temp2[k];
}
}
To ensure each loop is executed in all cases, you must transform the code as shown in the following example. In this example, the conditional statement is moved into the first loop. Both loops are always executed, and data always flows from one loop to the next.
void foo(int data_in[N], int data_out[N], int sel) {
int temp1[N], temp2[N];
Loop1: for(int i = 0; i < N; i++) {
if (sel) {
temp1[i] = data_in[i] * 123;
} else {
temp1[i] = data_in[i] * 321;
}
}
Loop2: for(int j = 0; j < N; j++) {
temp2[j] = data_in[j];
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp1[k] * temp2[k];
}
}
Loops with Multiple Exit Conditions
Loops with multiple exit points cannot be used in a DATAFLOW region. In the
following example, Loop2
has three exit conditions:
- An exit defined by the value of
N
; the loop will exit whenk>=N
. - An exit defined by the
break
statement. - An exit defined by the
continue
statement.#include "ap_int.h" #define N 16 typedef ap_int<8> din_t; typedef ap_int<15> dout_t; typedef ap_uint<8> dsc_t; typedef ap_uint<1> dsel_t; void multi_exit(din_t data_in[N], dsc_t scale, dsel_t select, dout_t data_out[N]) { dout_t temp1[N], temp2[N]; int i,k; Loop1: for(i = 0; i < N; i++) { temp1[i] = data_in[i] * scale; temp2[i] = data_in[i] >> scale; } Loop2: for(k = 0; k < N; k++) { switch(select) { case 0: data_out[k] = temp1[k] + temp2[k]; case 1: continue; default: break; } } }
Because a loop’s exit condition is always defined by the loop bounds, the use of
break
orcontinue
statements will prohibit the loop being used in a DATAFLOW region.Finally, the DATAFLOW optimization has no hierarchical implementation. If a sub-function or loop contains additional tasks that might benefit from the DATAFLOW optimization, you must apply the DATAFLOW optimization to the loop, the sub-function, or inline the sub-function.
std::complex
inside the
DATAFLOW region. However, they should be used with an __attribute__((no_ctor))
as shown in the following
example:void proc_1(std::complex<float> (&buffer)[50], const std::complex<float> *in);
void proc_2(hls::Stream<std::complex<float>> &fifo, const std::complex<float> (&buffer)[50], std::complex<float> &acc);
void proc_3(std::complex<float> *out, hls::Stream<std::complex<float>> &fifo, const std::complex<float> acc);
void top(std::complex<float> *out, const std::complex<float> *in) {
#pragma HLS DATAFLOW
std::complex<float> acc __attribute((no_ctor)); // Here
std::complex<float> buffer[50] __attribute__((no_ctor)); // Here
hls::Stream<std::complex<float>, 5> fifo; // Not here
proc_1(buffer, in);
proc_2(fifo, buffer, acc);
proc_3(out, fifo, acc);
}