The VSC mode allows compilation of accelerators with CUs that contain user-defined hardware pipelines, as described in Building Hardware. Such a pipeline is composed of PEs that connect to each other through AXI4-Stream and can also connect to platform ports that are AXI4 connections, such as global memory or IO interfaces such as an ethernet QSFP port. The platform will provide IP that translate such interfaces into AXI4-Stream ports which can be connected to PEs in the user-defined pipeline.
Using VSC such hardware pipelines can be easily configured to dynamically change processing behavior at runtime from an application running in the host CPU. The following describes how such an accelerator can be created. An example system composition is show in the picture given below.
The Eth_Rx
and Eth_Tx
modules are typically platform IP that translate AXI4-Stream words into ethernet packets. These can also be
custom IP with user-defined AXI4 interfaces.
The rest of the accelerator pipeline, shown in the white box, is created
with VSC using AXI4-Stream connections. The PEs in
the pipeline are user-defined functionality, such as packet processing like an internet
protocol packet filter. In this example there is a pipeline created with two tasks,
which are the PEs called mod
and smp
. Additionally,
the system is composed of another control PE that has AXI4-Stream connections to these pipeline PEs. Example accelerator code is
provided below, with the .hpp file on the left and the
.cpp on the right.
|
|
In this example, five PEs are defined including the
eth_tx
and eth_rx
which mock the platform IP
behavior in receiving and transmitting words in the AXI4-Stream. The compute()
scope implements the
accelerator pipeline using AXI4-Stream connections
between these PEs. The control
PE can send command words on these
streams and the task PEs (fsk_mod
and fsk_smp
) monitor
these command AXI4-Stream and react by changing
behavior. The fsk_smp
PE reacts by sampling a requested number of
packets back to the control
PE. The fsk_mod
PE reacts
by adding a value to the packet data or by dropping packets that are being passed from
Eth_Rx
into Eth_Tx
.
The pipeline PEs, fsk_mod
and fsk_smp
,
are FREE_RUNNING
as described in Guidance Macros
because they are never-ending PEs driven to operation by the words in their input
streams.
The control
PE talks to the host CPU through two
SYS_PORT
connections for interface argument data pointers for input
(dIn
) and output (dOut
), as well as the scalar
command argument (cmd
). The control
PE is not
free-running and reacts to compute()
calls from the host CPU. This
system composition is entirely user-defined including the nature of the commands and
corresponding PE functionality.
The host code snapshot is shown here and the entire example is available on GitHub.
// -- file: host.cpp --
#include "vpp_acc_core.hpp" // required
#include "ETH.hpp"
int config_sample(int sz)
{
printf("main: sample %d\n", sz);
Pkt* sample = (Pkt*)ETH::alloc_buf(sz * sizeof(Pkt), vpp::output);
Pkt* config = (Pkt*)ETH::alloc_buf(sizeof(Pkt), vpp::input);
config[0].dt = sz;
auto fut = ETH::compute_async(cmd_sample, config, sample);
fut.get();
print_sample(sample, sz);
int pkt_nr = sample[0].nr;
ETH::free_buf(config);
ETH::free_buf(sample);
return pkt_nr;
}
The job commands issued by the VSC host code, specifically using the
compute_async()
API, enables the control PE to translate the
command and in-turn pass configuration words to the pipeline PEs through the command
streams. This snaphot shows a user-defined API that issues a packet sampling command. A
sample
buffer of a required sz
is allocated at run
time, and the compute_async()
call will trigger the control PE to
capture sz
number of packets and return the words back to the host. The
fut
returned by compute()
is blocking in the host
code until the results are available. However, the compute_async()
as
name denotes is an asychronous call that triggers the accelerator. Once the sample words
are returned and processed by the host and the corresponding buffers can be freed.
Because host control in this case is not a continuous pipeline of compute jobs,
but just an occasional, non-timing critical job, the send_while/receive_all
thread will not manage this. Instead, the
synchronization is application managed using the compute_async()
API defined in vpp_acc_core.hpp.
vpp_acc.hpp
is included.With VSC such hardware pipelines can be composed and be controlled asynchronously from a host CPU. One of the applications for such an accelerator is a packet processing accelerator on a NIC card. For example the X3 Hybrid platforms provides ethernet transmission and reception IPs which convert ethernet packets arriving at the QSFP ports into AXI4-Stream, through a MAC interface. Furthermore, a NIC interface allows the accelerator to provide data to a connected host CPU over PCIe, or a Host-Memory access may be used for direct host CPU memory access over PCIe. Using VSC, the accelerator packet processing pipeline can be composed on the PL and can be controlled by a CPU asynchronously over PCIe.