Canonical Dataflow - Canonical Dataflow - 2025.2 English - UG1399

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2026-01-22
Version
2025.2 English

When Vitis HLS finds the #pragma HLS dataflow, it converts the function or loop body where the pragma occurs (called "dataflow region" in the following) into a set of processes.

This set of processes can be identified:

  • Either by the designer, who either calls one function or instantiates one hls::task per process, e.g.:
    void dataflow(int Input0, int Input1[], int &Output0, int Output1[]) {
    #pragma HLS dataflow
      int C0, C1[N], C2; 
      func1(Input0, Input1, C0, C1); // first process
      func2(C0, C1, C2); // second process
      func3(C2, Output0, Output1); // third process
    }
  • Or by the tool, if the dataflow region contains code other than function calls, e.g.:
    void dataflow(int Input0, int Input1[], int &Output0, int Output1[]) {
    #pragma HLS dataflow
      int C0, C1[N], C2;      C0 = Input0  * 3; // first process
      for (int i = 0; i < N; i++) { // first process or second process?
        C1[i] = Input1[0] + 2;
      }
      func2(C0, C1, C2); // second process or third process?
      func3(C2, Output0, Output1); // third process or fourth process?
    }

The first form produces a resulting dataflow network that is more predictable, in terms of both performance and structure, while the second one is much less predictable, because there can be more than one choice.

This section discusses how to achieve good predictability by following a set of coding style guidelines (also known as "canonical dataflow").

These guidelines will be discussed at three levels, from basic to advanced.

The Vitis HLS front-end generates warning whenever the coding style rules are violated.

In this case, Vitis HLS does its best to implement the resulting dataflow (or errors out if the violations are fatal, e.g. when an array has multiple writer processes or multiple reader processes).

But one must always check the GUI dataflow viewer and the cosimulation timeline trace to ensure that the dataflow happens as expected and the achieved performance is as expected.

In particular any non-dataflow hierarchy level (i.e. sequential FSM) within the dataflow hierarchy typically indicates loss of performance, since they do not allow overlapped execution of the dataflow regions that they contain.

Basic Level

At the most basic level, the dataflow pragma can be applied in two different contexts:

  • Inside a function (also known as function dataflow region)

  • Inside a special kind of for loop (also known as loop dataflow region) such that:

    • The loop is the only statement inside the body of a function
      • Variable declarations outside the loop are forbidden
    • The loop variable is of type integer

    • The initial value is declared in the loop header and set to any non-negative integer constant.
    • The exit condition is an "<" comparison with a non-negative numerical constant or a scalar argument of the function that encloses the loop.
    • The loop variable is incremented by any positive numerical integer constant.

The recommended coding style for predictability is that the body of that function or for loop must contain only a sequence of

  • sub-function calls (using ap_ctrl_chain) and/or
  • hls::task declarations (using ap_ctrl_none).

There cannot be any control inside a dataflow region (if-then-elses and loops would be automatically converted into processes, which results in the unpredictability of dataflow structure discussed above).

These called sub-functions and hls::tasks:

  • must have a return type void
  • can be:
    • Sequential functions or pipelined functions, called processes, or
    • Function dataflow regions, or
    • Loop dataflow regions.

Other syntactic restrictions:

  • Variable initialization (including those performed automatically by constructors) is not allowed:

    • For standard data types that have default constructors that cannot be redefined (for example: std::complex), one can avoid initialization by using the no_ctor attribute
      • For example:
        std::complex<float> arr[SIZE] __attribute__((no_ctor));
  • Passing expressions by value to processes is not allowed.

    • In the example above, the first process must be declared as follows (note the passing of Input1 by reference):

      void func1(int Input0, int Input1[], int &Output0, int Output1[])

Example of recommended style for a dataflow function:

void dataflow(int Input0, int Input1[], int &C0, int C1[]) {
#pragma HLS dataflow
  int C1[N], C2;  // no initialization
  UserDataType C0 __attribute__((no_ctor)); // no_ctor must be used if the default constructor is not empty
  func1(Input0, Input1, C0, C1); // read Input0, read Input1, write C0, write C1
  func2(C0, C1, C2); // read C0, read C1, write C2
  func3(C2, Output0, Output1); // read C2, write Output0, write Output1
}

Example of recommended style for dataflow in loop:

void dataflow(int Input0, int Input1[], int &Output0, int Output1[]) {
  for (int i = 0; i < N; i++) {
    #pragma HLS dataflow
    int C1[N], C2;  // no initialization
    UserDataType C0 __attribute__((no_ctor)); // no_ctor must be used if the default constructor is not empty
    func1(Input0, Input1, C0, C1, i); // read Input0, read Input1, write C0, write C1
    func2(C0, C1, C2, i);             // read C0, read C1, write C2
    func3(C2, Output0, Output1, i);   // read C2, write Output0, write Output1
  }
}
Note: the function where the loop occurs does not require the dataflow pragma.

Further semantic restrictions about the body of dataflow functions or regions:

  • Local variables must be non-static scalars or arrays (static variables are allowed only inside called processes).
    • Instances of hls::tasks and hls::threads in a canonical region must be declared as hls_thread_local (e.g. hls_thread_local hls::task t1(proc, arg1, arg2, arg3); ).
      • Note that hls_thread_local is like static, but it is safer since it does not imply a shared single variable instance among multiple hls::tasks with the same function body.
  • The sequence of function calls and/or hls::tasks must transfer data using local variables, under the following conditions:
    • Arrays:

      • Can have only one writer process and one reader process.
      • The writer must be lexically before the reader (i.e. no loop-carried dependences)
    • hls::streams and hls::stream_of_blocks:
      • Can have only one reading process and one writing process.
      • Can be used also to transfer data backwards (to processes that are lexically earlier)
      • Care must be taken to ensure that processes that read data these feedback streams do not attempt to read from them until some data has been produced by later processes.
        • The most common way to satisfy this requirement is to use an hls_thread_local variable inside these processes, using it to skip the reading on the first execution
        • Using non-blocking reads for the same purpose is strongly discouraged, because it inherently leads to non-deterministic implementation and cosimulation mismatches.
    • Scalars:
      • can have multiple writer and reader processes
      • are automatically converted into FIFO channels
      • cannot have loop-carried dependences in case of dataflow-in-loop

Advanced Level

In addition to the basic level discussed above, the recommended predictable dataflow coding style also supports:

  • Fully unrolled loops within a dataflow region. The loop bodies can contain only function calls and hls::task instantiations.
  • Fully partitioned arrays, where each partition is passed as an independent variable (array or scalar) to one process.

For example this code creates a pipeline of N identical processes all performing the same functionality and cascaded via a chain of hls::streams:

void dut(int in[M], int out[M]) {
  #pragma HLS dataflow
  hls_thread_local hls::stream<int> chan[N+1]; // arrays of hls::streams are fully partitioned
    read_in(in, chan[0]);
    hls_thread_local hls::task t[N]; // array of worker processes
  for (int i=0; i<N; i++) {
    #pragma HLS unroll
    t[i](worker, chan[i], chan[i+1]);
  }
  write_out(chan[N], out);
}

This is another example that instead uses a chain of PIPOs (or streamed arrays if one adds #pragma HLS stream variable=chan):

void dut(int in[M], int out[M]) {
  #pragma HLS dataflow
  int chan[N+1][M]; / / partitioned into N+1 arrays of M elements
  #pragma HLS array_partition complete dim=1 variable=chan
    read_in(in, chan[0]);
  for (int i=0; i<N; i++) {
    #pragma HLS unroll
    worker(chan[i], chan[i+1]);
  }
  write_out(chan[N], out);
}