Unrolling and partitioning for parameterized task-level parallelism
In addition to the basic level discussed above, the recommended predictable dataflow coding style also supports a mechanism to parameterize task-level parallelism, via:
- Unrolled dataflow loops within a dataflow region. A dataflow pragma is required to ensure that the correctness of the region is checked before unrolling, and after unrolling the code must satisfy the canonicity conditions.
- Partitioned arrays, where each partition is passed as an independent variable (array or scalar) to one process.
For example, this code creates a pipeline of N identical processes all performing the same functionality and cascaded via a chain of hls::streams, where N is a compile-time constant:
void dut(int in[M], int out[M]) {
#pragma HLS dataflow
hls_thread_local hls::stream<int> chan[N+1]; // arrays of hls::streams are fully partitioned
read_in(in, chan[0]);
hls_thread_local hls::task t[N]; // array of worker processes
for (int i=0; i<N; i++) {
#pragma HLS unroll
#pragma HLS dataflow
t[i](worker, chan[i], chan[i+1]);
}
write_out(chan[N], out);
}
This is another example that instead uses a chain of PIPOs (or streamed arrays if one adds #pragma HLS stream variable=chan):
void dut(int in[M], int out[M]) {
#pragma HLS dataflow
int chan[N+1][M]; / / partitioned into N+1 arrays of M elements
#pragma HLS array_partition complete dim=1 variable=chan
read_in(in, chan[0]);
for (int i=0; i<N; i++) {
#pragma HLS unroll
#pragma HLS dataflow
worker(chan[i], chan[i+1]);
}
write_out(chan[N], out);
}
Finally, this example performs a partial partitioning and a partial unrolling, still achieving the goal of creating a dataflow network that satisfies the single producer single consumer requirements after partitioning and unrolling:
void accum(int v, int &sum) {
sum += v;
}
void dut(int v[10], int sum[10]) {
#pragma HLS partition cyclic factor=2 variable=v
#pragma HLS partition cyclic factor=2 variable=sum
for (int i=0; i<10; i++) {
#pragma HLS dataflow
#pragma HLS unroll factor=2
accum(v[i], sum[i]);
}
}
Using single-producer single-consumer static variables for loop-carried dependences
The rule that there cannot be loop-carried dependences in a dataflow region is relaxed when there is a single producer and single consumer process for a static scalar variable, as in the following example.
Streams to implement feedback and loop-carried dependences
As discussed above, hls::streams and hls::stream_of_blocks can be used to transfer data backwards (to processes that are lexically earlier) and implement loop-carried dependences under user control.
- Care must be taken to ensure that processes that read data from these feedback streams do not attempt to read from them until some data has been produced by later processes.
- The most common way to satisfy this requirement is to use a variable to skip the reading on the first execution. This can be:
- Either a static variable in the process function, paying great care to the fact that there must not be multiple instances of the process, otherwise the single instance of the static variable would create a hidden communication channel between these processes, for example:
// consumes data
void read(hls::stream<int> &data, int &sum) {
static bool do_read = false;
sum = 0;
if (do_read) {
for (int i=0; i<SIZE; ++i)
sum += data.read();
}
do_read = true;
}
// produces data
void write(hls::stream<int> &data, int input[SIZE]) {
for (int i=0; i<SIZE; ++i)
data.write(input[i]);
}
void test(int input[SIZE], int &sum) {
#pragma HLS dataflow
hls::stream<int> data;
read(data, sum); // consumes data, produces sum
write(data, input); // consumes input, produces data
}
- An hls_thread_local variable in the hls::task, with the same use (hls_thread_local is better in the hls::task context because it ensures no hidden communication between multiple instances of the hls::task).
- Using non-blocking reads, making sure that:
- The consumer of the stream is a dataflow function, not an hls::task
- If it does not see any input data, then it immediately returns, as in the following code:
// consumes data
void read(hls::stream<int> &data, int &sum) {
if (data.empty) return;
sum = 0;
for (int i=0; i<SIZE; ++i)
sum += data.read();
}
// produces data
void write(hls::stream<int> &data, int input[SIZE]) {
for (int i=0; i<SIZE; ++i)
data.write(input[i]);
}
void test(int input[SIZE], int &sum) {
#pragma HLS dataflow
hls::stream<int> data;
read(data, sum); // consumes data, produces sum
write(data, input); // consumes input, produces data
}