The Vitis HLS compiler can reorder unrelated accesses to improve performance. However, to ensure logic correctness (For example; preserve the program semantics), it preserves the sequential semantics of the program, for both "control dependences" and "data dependences."
When the user wants, in addition, to force some order between unrelated accesses, the Vitis HLS fence can be used to guarantee two (or more) accesses to streams, arrays, and scalar function arguments passed by reference are scheduled following the sequential order implied by the source code.
hls_fence.h
into your source. An example of fence usage can be
found in Vitis-HLS-Introductory-Examples/Modeling on GitHub.Full and Half Fence
The library offers two different types of fences.
Full Fence
hls::fence(obj_1, obj_2, ..., obj_n);
The full fence constrains the scheduling of the accesses to obj_1, obj_2, ..., obj_n. Any access that is, from a sequential point of view, before (resp. after) the fence will remain scheduled before (resp. after) the fence.
Half Fence
The half fence is less restrictive and is useful in pipeline loops.
hls::fence({obj_1, obj_2, ..., obj_n}, {obj'_1, obj'_2, ..., obj'_p});
The fence above constrains the access schedule as follows: objects in the first set of curly braces can be scheduled earlier than the fence. But if they are before the fence in sequential order, they cannot be scheduled after. Similarly, accesses to objects in the second set (right curly braces) can be scheduled later than the fence. But if they are after the fence in sequential order, they cannot be scheduled before.
hls::fence(a,b)
is
equivalent to the half fence hls::fence({a, b}, {a, b});
.Practical Examples
Here are two typical scenarios where hls::fence is useful.
Avoiding Deadlocks
Suppose two communicating dataflow processes.
Producer(...) {
int bound = ...;
strm1.write(bound);
for (int i=0; i<bound; i++) {
strm2.write(...);
}
}
Consumer(...) {
int bound = strm1.read();
for (int i=0; i<bound; i++) {
... = strm2.read();
}
}
The second process (Consumer) needs to read the stream strm1 to access the loop bound and start iterating on reading the stream strm2. This loop bound is sent by the first process (Producer) before writing in strm2.
In that arrangement, both processes can overlap and only a small FIFO is needed to store the elements in strm2. In principle, you would not expect the compiler and/or the scheduler to move the write of strm1 after the loop. In this case, the consumer cannot start reading before all elements are written in strm2 and a stream of depth "bound" is needed to avoid a deadlock.
The use of the fence will guarantee that the write of strm1 happens before the write of strm2, as follows.
Producer(...) {
int bound = ...;
strm1.write(bound);
hls::fence({strm1}, {strm2});
for (int i=0; i<bound; i++) {
strm2.write(...);
}
}
Consumer(...) {
int bound = strm1.read();
for (int i=0; i<bound; i++) {
... = strm2.read();
}
}
Configuring the Vivado LogiCORE FFT
The initialization for the configuration of the FFT and the loop to send the data can be written as follows:
void inputdatamover(... input_parameters, config_t *fft_config, cmpxDataIn in[FFT_LENGTH], cmpxDataIn xn[FFT_LENGTH]) {
config_t fft_config_tmp;
fft_config_tmp.setDir(...);
fft_config_tmp.setSch(...);
*fft_config = fft_config_tmp;
for(int i=0; i<FFT_LENGTH; i++) {
#pragma HLS pipeline rewind
xn[i] = in[i];
}
}
For correctness it is imperative that the configuration of the FFT gets captured prior to the actual processing of the data.
As written above, there is no guarantee, and a fence might be needed:
void inputdatamover(... input_parameters, config_t *fft_config, cmpxDataIn in[FFT_LENGTH], cmpxDataIn xn[FFT_LENGTH]) {
config_t fft_config_tmp;
fft_config_tmp.setDir(...);
fft_config_tmp.setSch(...);
*fft_config = fft_config_tmp;
hls::fence({fft_config}, {xn});
for(int i=0; i<FFT_LENGTH; i++) {
#pragma HLS pipeline rewind style=flp
xn[i] = in[i];
}
}
Known Limitations
The fence cannot be used as a barrier involving all variables by default. All variables that need to be constrained should be explicitly arguments of the fence.
Fences can be applied in a scheduling region (pipeline or sequential) but not into a dataflow region.