The PEs in a pipeline operate synchronously on transactions passing
through. For every compute()
call a PE will start
and stop exactly once. However, when a PE is marked as FREE_RUNNING
as described in Guidance Macros, it has the following hardware semantics:
- The PE does not start, stop, or reset per transaction or
compute()
call. It is an HLS kernel with theap_none
control interface as described in Block-Level Control Protocols - The interface must have only AXI4-Stream arguments, or scalar inputs
- Operation is data-driven, with the PE acting only on the input stream words, and unaware of the payload size of the transaction
- The PE begins execution immediately after the hardware bitstream is programmed into the device
The figure above shows a diagram of the free-running PE. This accelerator
contains two PEs, a LdStr
PE that has global
memory access, and the fsk_incr
which is a
free-running PE. In the compute()
scope these PEs
are connected by two AXI4-Stream interfaces:
AS
that moves words from LdStr
to fsk_incr
,
and XS
which is the feedback path.
The code for this example is provided below.
class fsk_acc : public VPP_ACC<fsk_acc, NCU>
{
ZERO_COPY(A);
ZERO_COPY(X);
SYS_PORT(A, DDR[0]);
SYS_PORT(X, DDR[0]);
FREE_RUNNING (fsk_incr);
public:
static void compute(DT* A, DT* X, int sz);
static void loadstore(DT* A, DT* X, hls::stream<DT>& AS,
hls::stream<DT>& XS);
static void fsk_incr(hls::stream<DT>& AS, hls::stream<DT>& XS);
};
Void fsk_acc::compute(DT* A, DT* X, int sz)
{
static vpp::stream<DT> AS, XS;
ldst(A, X, sz, AS, XS);
fsk_incr(AS, XS);
}
void fsk_acc::ldst(DT* A, DT* X, int sz, hls::stream<DT>& AS,
hls::stream<DT>& XS)
{
for (int i = 0; i < sz; i++) {
AS.write(A[i]);
}
for (int i = 0; i < sz; i++) {
XS.read(X[i]);
}
}
void fsk_acc::fsk_incr(hls::stream<DT>& AS, hls::stream<DT>& XS)
{
DT val;
AS.read(val);
XS.write(val + 1);
}
The LdSt
PE operates on sz
words reading and writing to the global memory
ports A and X respectively. Whereas, the free-running PE fsk_incr
is agnostic to sz
, and
reacts only to the words on the incoming AS
stream.
The free-running semantics described earlier greatly simplifies the implementation of a free-running PE, often simplifying the FPGA usage and routing resources required. It enables the design of a streaming pipeline design where the intermediate PEs can be free-running, thereby operating only on the input AXI4-Stream.
With any pipeline composition, when hardware replication is enabled
(NCU is more than 1), the hardware contains as many replicated pipelines, and
compute()
jobs run on available pipeline slot.
Thus, the application layer remains simple and automates running data through
multiple pipelines in the hardware.