This section describes certain specialized data transfers to and from the accelerator, such as parts of device (result) buffers, and different styles of file transfers allowing a single accelerator computation to simultaneously process multiple files.
Customized Transfer of I/O Sub-Buffers
Sometimes you do not want to sync the whole output buffer back to the host, and the sizes of what you want to sync back are also not known up front in the host code. A good example is compression. Especially if the compression algorithm is compressing multiple chunks into the same output buffer and you want to sync back only the exact compressed chunks from the output buffer. As shown in the following example, this can be done by registering a function which will control exactly what will be synced back once the data is available.
xfilter::custom_sync_outputs([=]()
{
auto fut = xfilter::sync_output<int>(outSz, chunks, 0);
fut.get();
for (int chunk = 0; chunk < chunks; ++chunk) {
xfilter::sync_output<int>(out, outSz[chunk], chunk * chunkSz);
}
});
The custom_sync_outputs()
method registers a
callback function which will determine which buffer/sub-buffers will be synced back
to the host. This method would have to be called inside the
send_while
body, before the first compute()
call.
custom_sync_outputs()
disables any automatic transfer-back of
output buffers.Inside the callback function the user-defined code has full control of what
is synced back and if there needs to be synchronization. The example above syncs
back an outSz
buffer. This buffer contains the
sizes of the compressed chunks. The sync_output
API returns a std::future
. Calling fut.get()
on that future will make the code wait for
that sync to finish. Then the code transfers all chunks, with their exact sizes,
back to the host. These calls will also return a future, but there is no need to
synchronize for those syncs anymore. The System Compilation runtime layer will make
sure that all those syncs have finished before starting the corresponding receive_all
iteration.
File Transfer Modes
System Compilation mode also supports easy reading and writing to
files as shown in the file_filter_sc
in Supported Platforms and Startup Examples. This can be enabled in the
create_bufpool()
method using either of the following modes as
discussed in VPP_ACC Class API:
-
vpp::p2p
mode : the file is transferred over the peer-to-peer PCIe bridge to the accelerator. This works only on platforms that support the P2P feature, for example the U2 card with a connected smartSSD. -
vpp::h2c
mode : file is transfer from a host CPU (connected file server) to the card over PCIe. This is standard for most Alveo cards connected to a host CPU over PCIe.
This simple file-transfer switch (vpp::p2p
and
vpp::h2c
) makes accelerator design portable across platforms.
It it is a simple matter to test a design for any platform using software emulation
on a typical host CPU (connected to a file server) hosting the data files. However,
eventually, the design must be compiled for a specific platform supporting P2P, such
as the U2 card connected to a smartSSD, thereby allowing direct porting without
changing the design sources.
The following example code demonstrates this:
auto inBP = my_acc::create_bufpool(vpp::input , P2P ? vpp::p2p : vpp::h2c);
my_acc::send_while([=]() {
if (P2P) o_flags |= O_DIRECT;
int fd = open(fnm, o_flags, s_flags);
DT* in = (DT*)xfilter::file_buf(inBP, fd, fsz);
my_acc::compute(in, ...);
...
Based on the value of the P2P
flag, this
code will either do peer-to-peer (P2P) transfer of a host mapped NVMe device file,
or it will do host file to device transfer (H2C). In the P2P case, the file will be
loaded into the device memory directly from the NVMe device, without any data
transfer through the host. In case of H2C, the System Compilation runtime layer will
automatically transfer the file from the host to the device buffer (or vice versa
for outputs).
To enable P2P the file has to be opened with O_DIRECT
flag specified as shown in the example above. When in P2P
mode, the host pointer, as returned by the call to VPP_ACC::file_buf
, is a handle for the compute call argument, and
cannot be read from or written to.
Multi-File Buffers
As described in VPP_ACC Class API, you can make
multiple calls to the file_buf
method before calling the
compute()
method, to map multiple files, or multiple file
segments into a single device buffer. The code example below shows small portions of
multiple files being processed simultaneously by the accelerator in one
compute()
call.
void* VPP_ACC::file_buf(VPP_BP bp, int fd, size_t sz, off_t fos = 0, off_t bos = 0)
xfilter::send_while([=, &total_out_size]()
{
static int iter = 0;
int* in;
// collect all "chunks" input files into one "in" buffer
for (int chunk = 0; chunk < chunks; ++chunk) {
std::stringstream nm;
nm << DATA << iter << '-' << chunk << ".orig";
int ifd = open(nm.str().c_str(), rd_o_flags);
assert(ifd > 2);
in = xfilter::file_buf<int>(inBP, ifd, chunkSz, 0, chunk * chunkSz);
}
// prepare output buffer to be able to hold all chunks
int* out = xfilter::file_buf<int>(outBP, 0, chunks * chunkSz, 0);
// output buffer to provide the actual filtered size of each chunk
int* outSz = xfilter::alloc_buf<int>(outSzBP, chunks);
....
xfilter::compute(chunks, chunkSz, in, out, outSz);
....
});
The code creates an input (in
) buffer that holds
these file segments, and an output (out
) buffer that holds the
processed output data. In every send_while
iteration the
file_buf()
provides the chunkSz
and read
offset (chunk*chunkSz
) for each file associated with the
in
buffer. In the subsequent call to compute()
all those files segments will be written to the input or read from the output device
buffers.
in =
xfilter::file_buf<...>
is repeated in the for-loop so that the last
returned pointer is assigned to in
. This is important to follow
while writing the application code.Custom Transfer of Output files
You can also use custom sync for file buffers, in which an output
buffer can be synced to a file descriptor in
custom_sync_outputs()
. As explained in VPP_ACC Class API, you can do this by calling the
sync_output_to_file()
method as shown in the following example.
VPP_ACC::sync_output_to_file(void* buf, int fd, size_t byte_sz,
off_t fd_byte_offset = 0,
off_t buf_byte_offset = 0);
In this case files added by a call to VPP_ACC::file_buf(bufPool, fd, sz)
will not get synced automatically.
So a call to the file_buf
API is not needed to
actually add a file, but it is required only to return a host pointer. Adding a
dummy file like this is the best approach: VPP_ACC::file_buf(bufPool,0,0);
Here's an example code snippet:
my_acc::send_while([=]() {
DT* out = (DT*)my_acc::file_buf(outBP, 0, 0);
...
my_acc::custom_sync_outputs([=](){
...
auto fut = my_acc::sync_output_to_file(out, fd, sz, fd_offset, buf_offset);
...
});
...