You can define a hardware accelerator derived from the VPP_ACC
base class, and build the hardware and
software interface with VSC. This section describes the software API provided by the
VPP_ACC
class.
Controlling the Accelerator
The VPP_ACC
class API provides
methods for scheduling jobs on the accelerator hardware, processing results, and
other software controls at run time.
- send_while()
-
This API executes the SendBody repeatedly until it returns a boolean false; the pseudo-code of the send_while is similar to do{ bool ret=f() } while(ret==true);
void send_while(SendBody f, VPP_CC& cc = *s_default_cc);
Argument Description SendBody f SendBody is a user-defined C++ lambda function. The lambda function captures variables from the enclosing scope. Notably all used buffer pool variables need to be captured. They should be captured by value. Using
[=]
will automatically capture those by value as well any other variable you might use inside the lambda function. Any variable which gets modified inside the lambda function needs to be passed by reference. For example[=, &m]
. Passing variables by reference unnecessarily can result in degraded host code performance, but on the other hand, passing a large class object by value might lead to an unnecessary deep copy. The latter however is unlikely the needed for the send (or receive) functions.Tip: Any variable which needs to be passed by reference can be captured explicitly with[=, &var]
.VPP_CC& cc Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support. - compute()
-
As described in The compute() API, the
compute()
method is a special user-defined method in the derivedVPP_ACC
accelerator class definition which is used to represent the CU and contain the processing elements (PEs).void compute(Args ...);
- a call to the hardware accelerator that schedules one job
- one or multiple
compute()
calls can be made inside theSendBody
function - each
compute()
call is non-blocking and will return immediately, but will block when the task pipeline is full. - in the background, a
compute()
call will make sure that all its inputs get transferred to the device and then executed on any available CU - once all
compute()
calls of an iteration have finished, the output buffers are transferred back to the host and areceive_all
iteration will be started for that iteration - The following conditions must be followed by the
application code, and are asserted during software emulation
- once a
compute()
call has been made, input buffers and file buffers cannot be modified anymore, and no more calls toalloc_buf
orfile_buf
can be made in that iteration. - output buffers cannot be read or written
until data is received in the corresponding
receive
iteration
- once a
- receive_all_xxx()
-
Executes a C++ lambda function repeatedly, either in order or ASAP, whenever a compute request completes to receive data results from the hardware accelerator. Exits when
send_while()
has exited and all iterations have been received.void receive_all_in_order(RecvBody f, VPP_CC& cc = *s_default_cc);
void receive_all_asap(RecvBody f, VPP_CC& cc = *s_default_cc);
Argument Description RecvBody f RecvBody is a user-defined C++ lambda function. Refer to the explanation of lamda functions in
send_while()
VPP_CC& cc Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support. - receive_one_xxx()
-
As described in Multi-Accelerator Pipeline Composition, this is used to receive one iteration of this accelerator inside the
send_while()
loop of another accelerator.void receive_one_in_order(RecvBody f, VPP_CC& cc = *s_default_cc);
void receive_one_asap(RecvBody f, VPP_CC& cc = *s_default_cc);
Argument Description RecvBody f RecvBody is a user-defined C++ lambda function. Refer to the explanation of lamda functions in
send_while()
VPP_CC& cc Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support. - join()
-
Wait for send and receive loops to finish.
void join(VPP_CC& cc = *s_default_cc);
Argument Description VPP_CC& cc Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support. - set_ncu()
-
Set the number of CUs the driver should use. Use this method before starting the send/receive loop to establish the number of CUs the
compute()
function should use.void VPP_ACC::set_ncu(int ncu);
Argument Description int ncu The number of CUs specified (ncu) should be (1 <= ncu <= NCU) where NCU is the template parameter of VPP_ACC< .... , NCU>
, as described in User-Defined Accelerator Class. - get_ncu()
-
Returns the number of CU currently used by the driver, as previously set by
VPP_ACC::set_ncu
, or if not modified, the NCU template parameter ofVPP_ACC< .... , NCU>
.int VPP_ACC::get_ncu();
- get_NCU
-
Returns the number of CUs implemented in HW (i.e. the value of the NCU template parameter provided when building hardware, and specified in the base class
VPP_ACC< .... , NCU>
.int VPP_ACC::get_NCU();
Setup I/O for Computation
The API methods described here are used to setup input and output buffers to the hardware accelerator.
- create_bufpool()
-
Creates and returns an opaque class object to be used in other methods that require a buffer handle, such as
alloc_buf()
. Use before starting the send/receive loops.VPP_BP VPP_ACC::create_bufpool(vpp::Mode m, vpp::FileXFer = vpp::none);
Argument Description vpp::Mode m Can specify any of the following values to denote the data transfer type for each
compute()
argument:-
vpp::input
: data transfers into the accelerator -
vpp::output
: data transfers out of the accelerator -
vpp::bidirectional
: data transfer into and out of the accelerator -
vpp::remote
: data is resident only on the device memory connected to the accelerator, and not send or received by the host code
vpp::FileXFer = vpp::none Can specify any of the following values to indicate the location of a file for data transfer:-
vpp::p2p
: file is transferred over the P2P bridge to the accelerator. This works only on platforms that support the P2P feature, for example the U2 card with a connected smartSSD. -
vpp::h2c
: file is transfer from a host CPU (connected file server) to the card over PCIe. This is standard for most Alveo cards connected to a host CPU over PCIe. -
vpp::none
: uses regular buffer objects, not supporting a file transfer. This is the default value.
-
- alloc_buf()
-
Returns a pointer to the buffer object. Use inside the send thread's lamda function.
void* VPP_ACC::alloc_buf(VPP_BP bp, int byte_sz);
T* VPP_ACC::alloc_buf<T>(VPP_BP bp, int numT);
Important: The buffer gets allocated from the given buffer pool. The lifetime of the buffer is until the end of the matching receive iteration. At that point the buffer will automatically be returned to the buffer pool.Argument Description VPP_BP bp A buffer pool object returned by create_bufpool()
int byte_sz Specifies the number of bytes for the requested buffer int numT Specifies the number of elements for the requested <T> array buffer - file_buf()
-
This method will map a given file, or part of the file to a buffer from the specified buffer pool object. Use inside the send thread's lamda function. The
file_buf()
method can be called multiple times to map multiple files (or file segments) to different locations in a single buffer.The method returns a pointer to the buffer object which is just a host handle. The host cannot be used to read or write it.
void* VPP_ACC::file_buf(VPP_BP bp, int fd, int byte_sz, off_t fd_byte_offset=0, off_t buf_byte_offset=0);
T* VPP_ACC::file_buf<T>(VPP_BP bp, int fd, int numT, off_t fd_T_index=0, off_t buf_T_index);
Argument Description VPP_BP bp A buffer pool object returned by create_bufpool()
.int fd The file descriptor to read from or write to (or 0 when using custom_sync_outputs()
as described below). In P2P mode the file must have been opened with O_DIRECT flag.int byte_sz Specifies the number of bytes for the requested buffer. In P2P mode, this must align to the file system block size (4 kB). fd_offset Offset in the file to read from/write to. buf_offset Offset in the buffer to write to/read from. int numT Specifies the number of elements for the requested <T> array buffer. In P2P mode, this must align to the file system block size (4 kB). fd_T_index The array index in the file to start reading from/writing to. buf_t_index The buffer index to start writing to/reading from. Additional notes:- The statement
T* buf = file_buf<T>(bp, fd, num, fd_idx, buf_idx);
is the same asT* buf = (T*)file_buf(bp, fd, num*sizeof(T), fd_idx*sizeof(T), buf_idx*sizeof(T));
- The actual size of the buffer will be adjusted
as required. As a result the buffer returned by the last call
needs to be used in the
compute()
call(s). Once used in a compute call, no more mappings can be added in that iteration. - See
file_filter_sc
under Startup Example in Supported Platforms and Startup Examples
- The statement
- get_buf()
-
- Use inside the receive loop associated with a matching send loop. This returns the buffer object which was allocated in the matching send iteration.
void* VPP_ACC::get_buf(VPP_BP bp);
T* VPP_ACC::get_buf<T>(VPP_BP bp);
Argument Description VPP_BP bp A buffer pool object returned by create_bufpool()
- transfer_buf()
-
This method is to be used in multi-accelerator composition, as described in Multi-Accelerator Pipeline Composition, to transfer ownership of a buffer from one accelerator using a
receive_one_xxx()
method inside thesend_while
of another accelerator, to that other accelerator. This extends the lifetime of the buffer till the end of the receive iteration matching the current send iteration.Tip: This is especially useful forvpp::remote buffers
, because then the buffer will remain on the device and no copying or syncing will be needed.void* VPP_ACC::transfer_buf(VPP_BP bp);
T* VPP_ACC::transfer_buf<T>(VPP_BP bp);
Argument Description VPP_BP bp A buffer pool object returned by create_bufpool()
- custom_sync_outputs()
-
This method can be called in the body of a
send_while
loop, before the call to thecompute()
function. This lets you provide a custom sync function to sync output buffers back to the host application. It is useful when only some (and not all) output buffers data need to be transferred back from the hardware accelerator.Important: This will disable any automatic syncing of all output buffers.void custom_sync_outputs(std::function<void()> sync_outputs_fn)
Argument Description sync_outputs_fn Specifies the custom sync function that will be called automatically for each iteration of the
send_while
loop when the compute tasks of the iteration have finished.When the
sync_outputs_fn
returns AND all requestedsync_output()
calls are complete, a receive will be triggered for thesend_while
loop iteration. - sync_output()
-
This method is to be called inside the
sync_output_fn
passed to thecustom_sync_outputs()
method. It will perform the requested sync in the background, returning a future which the caller can check for completion of the transfer.std::future<void> sync_output(void* buf, size_t byte_sz, off_t byte_offset = 0);
std::future<void> sync_output<T>(T* buf, size_t numT, off_t Tindex = 0);
Argument Description buf Buffer pointer obtained as a capture from the sendBody()
scope.byte_sz Specifies the number of bytes for the requested buffer. byte_offset Offset in the buffer to write to/read from. numT Specifies the number of elements for the requested <T> array buffer. In P2P mode, this must align to the file system block size (4 kB). Tindex The buffer index to start writing to/reading from. - sync_output_to_file()
-
This method is to be called inside the
sync_output_fn
passed to thecustom_sync_outputs()
method. It will perform the requested sync in the background, returning a future which the caller can check for completion of the transfer.std::future<void> sync_output_to_file(void* buf, int fd, size_t byte_sz, off_t fd_byte_offset = 0, off_t buf_byte_offset = 0);
std::future<void> sync_output_to_file<T>(T* buf, int fd, size_t numT, off_t fd_T_index = 0, off_t buf_T_index = 0);
Argument Description buf Buffer pointer obtained as a capture from the sendBody()
scope.fd The file descriptor to write to. byte_sz Specifies the number of bytes for the requested buffer. fd_offset Offset in the file to read from/write to. buf_offset Offset in the buffer to write to/read from. numT Specifies the number of elements for the requested <T> array buffer. In P2P mode, this must align to the file system block size (4 kB). fd_T_index The array index in the file to start reading from/writing to. buf_t_index The buffer index to start writing to/reading from. - set_handle()
-
Use inside the
sendBody
to identify any objects you want associated with the current iteration of thesend_while
loop.void VPP_ACC::set_handle(intptr_t hndl);
void VPP_ACC::set_handle<T>(T hndl);
Argument Description hndl Anything you might want to associate with the current iteration of the send_while
loop.Tip: With the templatized form any class which has a simple assignment/copy operator can be used as a handle. - get_handle()
-
Used inside the RecvBody, this method returns the handle of an object that was set (
set_handle()
) in the matching send iteration of thesend_while
loop.intptr_t VPP_ACC::get_handle();
T VPP_ACC::get_handle<T>();