The following figure shows an example of a single-path pipeline which has
three PEs, namely AccLoad
, AccMul
, and AccStore
. The PEs AccLoad
and AccStore
access data stored in the global memory
through M_AXI channels. Note that the accelerator class header ties the inputData
and outputData
ports to DDR[0]. In this case, the ZERO_COPY
code was used for these PEs to directly access the global
memory.
The following is the code example for the figure.
typedef vpp::stream<DT> STREAM;
class Acc : public VPP_ACC<Acc, NCU>
{
ZERO_COPY(inputData);
ZERO_COPY(outputData);
SYS_PORT(inputData, DDR[0]);
SYS_PORT(outputData, DDR[0]);
public:
static void compute(DT* inputData, DT* outputData);
static void AccLoad(DT* inputData, STREAM& aStr,
STREAM& bStr, STREAM& iStr);
static void AccMul(STREAM& aStr, STREAM& bStr,
STREAM& cStr);
static void AccStore(STREAM& iStr, STREAM& cStr,
DT* outputData);
}
void Acc::compute(DT* inputData, DT* outputData)
{
static STREAM aStr, bStr, cStr, iStr;
AccLoad (inputData, aStr, bStr, iStr);
AccMul (aStr, bStr, cStr);
AccStore(iStr, cStr, outputData);
}
void Acc::AccMul(STREAM& aStr, STREAM& bStr, STREAM& cStr)
{
for (int i = 0 ; i < N_WORDS ; i ++) {
int res = aStr.read() * bStr().read();
cStr.write(res);
}
}
compute()
function body
represents a hardware pipeline. There are three function calls corresponding to the
PEs, and there are four local stream variables declared: -
AccLoad
takesinputData
and writes to three streams -
AccMul
processes a fixed number of words in input streamsaStr
andbStr
and writes the results tocStr
- The
AccStore
function will further process the incoming data iniStr
andcStr
to write results onoutputData
connected to DDR[0] port.
The PEs in this system will execute in a synchronous fashion such
that data flows through in a pipelined fashion. Every call of compute()
will load inputData
and trigger all PEs for a new transaction. Every call to
compute()
requires every PE to complete
execution (start and stop) exactly once. This example is a pipeline with 3-stages,
or PEs chained in a single-path. Thus, with a simple C++ coding style the user can
create a hardware pipeline.
Note that the VPP_ACC
class allows
replication of such pipeline using the NCU parameter. If NCU is more than 1, then
the hardware contains as many replicated pipelines. The calls to compute()
are automatically loaded in the next
available pipeline slot. Thus, the application layer remains simple and easy to
maintain, and automates running data through multiple pipelines in the hardware.