VPP_ACC Class API - 2023.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2023-12-13

Version

2023.2 English

You can define a hardware accelerator derived from the VPP_ACC base class, and build the hardware and software interface with VSC. This section describes the software API provided by the VPP_ACC class.

Controlling the Accelerator

The VPP_ACC class API provides methods for scheduling jobs on the accelerator hardware, processing results, and other software controls at run time.

send_while()

This API executes the SendBody repeatedly until it returns a boolean false; the pseudo-code of the send_while is similar to do{ bool ret=f() } while(ret==true);

void send_while(SendBody f, VPP_CC& cc = *s_default_cc);

Table 1. send_while() arguments
Argument	Description
SendBody f	SendBody is a user-defined C++ lambda function. The lambda function captures variables from the enclosing scope. Notably all used buffer pool variables need to be captured. They should be captured by value. Using `[=]` will automatically capture those by value as well any other variable you might use inside the lambda function. Any variable which gets modified inside the lambda function needs to be passed by reference. For example `[=, &m]`. Passing variables by reference unnecessarily can result in degraded host code performance, but on the other hand, passing a large class object by value might lead to an unnecessary deep copy. The latter however is unlikely the needed for the send (or receive) functions. Tip: Any variable which needs to be passed by reference can be captured explicitly with `[=, &var]`.
VPP_CC& cc	Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support.

compute()

As described in The compute() API, the compute() method is a special user-defined method in the derived VPP_ACC accelerator class definition which is used to represent the CU and contain the processing elements (PEs).

void compute(Args ...);

a call to the hardware accelerator that schedules one job
one or multiple compute() calls can be made inside the SendBody function
each compute() call is non-blocking and will return immediately, but will block when the task pipeline is full.
in the background, a compute() call will make sure that all its inputs get transferred to the device and then executed on any available CU
once all compute() calls of an iteration have finished, the output buffers are transferred back to the host and a receive_all iteration will be started for that iteration
The following conditions must be followed by the application code, and are asserted during software emulation
- once a compute() call has been made, input buffers and file buffers cannot be modified anymore, and no more calls to alloc_buf or file_buf can be made in that iteration.
- output buffers cannot be read or written until data is received in the corresponding receive iteration

receive_all_xxx()

Executes a C++ lambda function repeatedly, either in order or ASAP, whenever a compute request completes to receive data results from the hardware accelerator. Exits when send_while() has exited and all iterations are received.

void receive_all_in_order(RecvBody f, VPP_CC& cc = *s_default_cc);

void receive_all_asap(RecvBody f, VPP_CC& cc = *s_default_cc);

Table 2. receive_all_xxx() arguments
Argument	Description
RecvBody f	RecvBody is a user-defined C++ lambda function. Refer to the explanation of lamda functions in `send_while()`
VPP_CC& cc	Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support.

receive_one_xxx()

As described in Multi-Accelerator Pipeline Composition, this is used to receive one iteration of this accelerator inside the send_while() loop of another accelerator.

void receive_one_in_order(RecvBody f, VPP_CC& cc = *s_default_cc);

void receive_one_asap(RecvBody f, VPP_CC& cc = *s_default_cc);

Table 3. receive_one_xxx() arguments
Argument	Description
RecvBody f	RecvBody is a user-defined C++ lambda function. Refer to the explanation of lamda functions in `send_while()`
VPP_CC& cc	Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support.

join()

Wait for send and receive loops to finish.

void join(VPP_CC& cc = *s_default_cc);

Table 4. join() arguments
Argument	Description
VPP_CC& cc	Optional argument used for grouping CUs. For example it can be used to specify which multi-card cluster of CUs to use as described in CU Cluster and Multi-Card Support.

set_ncu()

Set the number of CUs the driver should use. Use this method before starting the send/receive loop to establish the number of CUs the compute() function should use.

void VPP_ACC::set_ncu(int ncu);

Table 5. set_ncu() arguments
Argument	Description
int ncu	The number of CUs specified (ncu) should be (1 <= ncu <= NCU) where NCU is the template parameter of `VPP_ACC< .... , NCU>`, as described in User-Defined Accelerator Class.

get_ncu()

Returns the number of CU currently used by the driver, as previously set by VPP_ACC::set_ncu, or if not modified, the NCU template parameter of VPP_ACC< .... , NCU>.

int VPP_ACC::get_ncu();

get_NCU

Returns the number of CUs implemented in HW (i.e. the value of the NCU template parameter provided when building hardware, and specified in the base class VPP_ACC< .... , NCU>.

int VPP_ACC::get_NCU();

Setup I/O for Computation

The API methods described here are used to setup input and output buffers to the hardware accelerator.

create_bufpool()

Creates and returns an opaque class object to be used in other methods that require a buffer handle, such as alloc_buf(). Use before starting the send/receive loops.

VPP_BP VPP_ACC::create_bufpool(vpp::Mode m, vpp::FileXFer = vpp::none);

Table 6. create_bufpool() arguments
Argument	Description
vpp::Mode m	Can specify any of the following values to denote the data transfer type for each `compute()` argument: `vpp::input`: data transfers into the accelerator `vpp::output`: data transfers out of the accelerator `vpp::bidirectional`: data transfer into and out of the accelerator `vpp::remote`: data is resident only on the device memory connected to the accelerator, and not send or received by the host code
vpp::FileXFer = vpp::none	Can specify any of the following values to indicate the location of a file for data transfer: `vpp::p2p`: file is transferred over the P2P bridge to the accelerator. This works only on platforms that support the P2P feature, for example the U2 card with a connected smartSSD. `vpp::h2c`: file is transfer from a host CPU (connected file server) to the card over PCIe. This is standard for most Alveo cards connected to a host CPU over PCIe. `vpp::none`: uses regular buffer objects, not supporting a file transfer. This is the default value.

alloc_buf()

Returns a pointer to the buffer object. Use inside the send thread's lamda function.

void* VPP_ACC::alloc_buf(VPP_BP bp, int byte_sz);

T* VPP_ACC::alloc_buf<T>(VPP_BP bp, int numT);

Important: The buffer gets allocated from the given buffer pool. The lifetime of the buffer is until the end of the matching receive iteration. At that point the buffer will automatically be returned to the buffer pool.

Table 7. alloc_buf() arguments
Argument	Description
VPP_BP bp	A buffer pool object returned by `create_bufpool()`
int byte_sz	Specifies the number of bytes for the requested buffer
int numT	Specifies the number of elements for the requested <T> array buffer

file_buf()

This method will map a given file, or part of the file to a buffer from the specified buffer pool object. Use inside the send thread's lamda function. The file_buf() method can be called multiple times to map multiple files (or file segments) to different locations in a single buffer.

The method returns a pointer to the buffer object (which is a host handle). The host cannot be used to read or write it.

void* VPP_ACC::file_buf(VPP_BP bp, int fd, int byte_sz, off_t fd_byte_offset=0, off_t buf_byte_offset=0);

T* VPP_ACC::file_buf<T>(VPP_BP bp, int fd, int numT, off_t fd_T_index=0, off_t buf_T_index);

Table 8. file_buf() arguments
Argument	Description
VPP_BP bp	A buffer pool object returned by `create_bufpool()`.
int fd	The file descriptor to read from or write to (or 0 when using `custom_sync_outputs()` as described below). In P2P mode the file is opened with O_DIRECT flag.
int byte_sz	Specifies the number of bytes for the requested buffer. In P2P mode, this must align to the file system block size (4 kB).
fd_offset	Offset in the file to read from/write to.
buf_offset	Offset in the buffer to write to/read from.
int numT	Specifies the number of elements for the requested <T> array buffer. In P2P mode, this must align to the file system block size (4 kB).
fd_T_index	The array index in the file to start reading from/writing to.
buf_t_index	The buffer index to start writing to/reading from.

Additional notes:

The statement T* buf = file_buf<T>(bp, fd, num, fd_idx, buf_idx); is the same as T* buf = (T*)file_buf(bp, fd, num*sizeof(T), fd_idx*sizeof(T), buf_idx*sizeof(T));
The actual size of the buffer will be adjusted as required. As a result the buffer returned by the last call needs to be used in the compute() call(s). Once used in a compute call, no more mappings can be added in that iteration.
See file_filter_sc under Startup Example in Supported Platforms and Startup Examples

get_buf()

Use inside the receive loop associated with a matching send loop. This returns the buffer object which was allocated in the matching send iteration.

void* VPP_ACC::get_buf(VPP_BP bp);

T*    VPP_ACC::get_buf<T>(VPP_BP bp);

Table 9. get_buf() arguments
Argument	Description
VPP_BP bp	A buffer pool object returned by `create_bufpool()`

transfer_buf()

This method is to be used in multi-accelerator composition, as described in Multi-Accelerator Pipeline Composition, to transfer ownership of a buffer from one accelerator using a receive_one_xxx() method inside the send_while of another accelerator, to that other accelerator. This extends the lifetime of the buffer till the end of the receive iteration matching the current send iteration.

Tip: This is especially useful for vpp::remote buffers, because then the buffer will remain on the device and no copying or syncing will be needed.

void* VPP_ACC::transfer_buf(VPP_BP bp);

T* VPP_ACC::transfer_buf<T>(VPP_BP bp);

Table 10. transfer_buf() arguments
Argument	Description
VPP_BP bp	A buffer pool object returned by `create_bufpool()`

custom_sync_outputs()

This method can be called in the body of a send_while loop, before the call to the compute() function. This lets you provide a custom sync function to sync output buffers back to the host application. It is useful when only some (and not all) output buffers data need to be transferred back from the hardware accelerator.

Important: This will disable any automatic syncing of all output buffers.

void custom_sync_outputs(std::function<void()> sync_outputs_fn)

Table 11. custom_sync_outputs() arguments
Argument	Description
sync_outputs_fn	Specifies the custom sync function that will be called automatically for each iteration of the `send_while` loop when the compute tasks of the iteration have finished. When the `sync_outputs_fn` returns AND all requested `sync_output()` calls are complete, a receive will be triggered for the `send_while` loop iteration.

sync_output()

This method is to be called inside the sync_output_fn passed to the custom_sync_outputs() method. It will perform the requested sync in the background, returning a future which the caller can check for completion of the transfer.

std::future<void> sync_output(void* buf, size_t byte_sz, off_t byte_offset = 0);

std::future<void> sync_output<T>(T* buf, size_t numT, off_t Tindex = 0);

Table 12. sync_output() arguments
Argument	Description
buf	Buffer pointer obtained as a capture from the `sendBody()` scope.
byte_sz	Specifies the number of bytes for the requested buffer.
byte_offset	Offset in the buffer to write to/read from.
numT	Specifies the number of elements for the requested <T> array buffer. In P2P mode, this must align to the file system block size (4 kB).
Tindex	The buffer index to start writing to/reading from.

sync_output_to_file()

std::future<void> sync_output_to_file(void* buf, int fd, size_t byte_sz, off_t fd_byte_offset = 0, off_t buf_byte_offset = 0);

std::future<void> sync_output_to_file<T>(T* buf, int fd, size_t numT, off_t fd_T_index = 0, off_t buf_T_index = 0);

Table 13. sync_output_to_file() arguments
Argument	Description
buf	Buffer pointer obtained as a capture from the `sendBody()` scope.
fd	The file descriptor to write to.
byte_sz	Specifies the number of bytes for the requested buffer.
fd_offset	Offset in the file to read from/write to.
buf_offset	Offset in the buffer to write to/read from.
numT	Specifies the number of elements for the requested <T> array buffer. In P2P mode, this must align to the file system block size (4 kB).
fd_T_index	The array index in the file to start reading from/writing to.
buf_t_index	The buffer index to start writing to/reading from.

set_handle()

Use inside the sendBody to identify any objects you want associated with the current iteration of the send_while loop.

void VPP_ACC::set_handle(intptr_t hndl);

void VPP_ACC::set_handle<T>(T hndl);

Table 14. set_handle() arguments
Argument	Description
hndl	Anything you might want to associate with the current iteration of the `send_while` loop. Tip: With the templatized form any class which has a simple assignment/copy operator can be used as a handle.

get_handle()

Used inside the RecvBody, this method returns the handle of an object that was set (set_handle()) in the matching send iteration of the send_while loop.

intptr_t VPP_ACC::get_handle();

T VPP_ACC::get_handle<T>();