The following figure shows an example of a single-path pipeline which has
three PEs, namely AccLoad
, AccMul
, and AccStore
. The PEs AccLoad
and AccStore
access data stored in the global memory
through M_AXI channels. The accelerator class header ties the inputData
and outputData
ports to DDR[0]. In this case, the ZERO_COPY
code was used for these PEs to directly access the global
memory.
The following is the code example for the figure.
typedef vpp::stream<DT> STREAM;
class Acc : public VPP_ACC<Acc, NCU>
{
ZERO_COPY(inputData);
ZERO_COPY(outputData);
SYS_PORT(inputData, DDR[0]);
SYS_PORT(outputData, DDR[0]);
public:
static void compute(DT* inputData, DT* outputData);
static void AccLoad(DT* inputData, STREAM& aStr,
STREAM& bStr, STREAM& iStr);
static void AccMul(STREAM& aStr, STREAM& bStr,
STREAM& cStr);
static void AccStore(STREAM& iStr, STREAM& cStr,
DT* outputData);
}
void Acc::compute(DT* inputData, DT* outputData)
{
static STREAM aStr, bStr, cStr, iStr;
AccLoad (inputData, aStr, bStr, iStr);
AccMul (aStr, bStr, cStr);
AccStore(iStr, cStr, outputData);
}
void Acc::AccMul(STREAM& aStr, STREAM& bStr, STREAM& cStr)
{
for (int i = 0 ; i < N_WORDS ; i ++) {
int res = aStr.read() * bStr().read();
cStr.write(res);
}
}
compute()
function body
represents a hardware pipeline. There are three function calls corresponding to the
PEs, and there are four local stream variables declared: -
AccLoad
takesinputData
and writes to three streams -
AccMul
processes a fixed number of words in input streamsaStr
andbStr
and writes the results tocStr
- The
AccStore
function will further process the incoming data iniStr
andcStr
to write results onoutputData
connected to DDR[0] port.
The PEs in this system will execute in a synchronous fashion such
that data flows through in a pipelined fashion. Every call of compute()
will load inputData
and trigger all PEs for a new transaction. Every call to
compute()
requires every PE to complete
execution (start and stop) exactly once. This example is a pipeline with 3-stages,
or PEs chained in a single-path. Thus, with a simple C++ coding style the user can
create a hardware pipeline.
The VPP_ACC
class allows replication of such
pipeline using the NCU parameter. If NCU is more than 1, then the hardware contains
as many replicated pipelines. The calls to compute()
are automatically loaded in the next available pipeline
slot. Thus, the application layer remains simple and easy to maintain, and automates
running data through multiple pipelines in the hardware.