Special Data Transfer Models - 2024.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
UG1393
Release Date
2024-07-03
Version
2024.1 English

This section describes certain specialized data transfers to and from the accelerator, such as parts of device (result) buffers, and different styles of file transfers allowing a single accelerator computation to simultaneously process multiple files.

Customized Transfer of I/O Sub-Buffers

Sometimes you do not want to sync the whole output buffer back to the host, and the sizes of what you want to sync back are also not known up front in the host code. A good example is compression. Especially if the compression algorithm is compressing multiple chunks into the same output buffer and you want to sync back only the exact compressed chunks from the output buffer. As shown in the following example, this can be done by registering a function which will control exactly what will be synced back once the data is available.

xfilter::custom_sync_outputs([=]()
{
    auto fut = xfilter::sync_output<int>(outSz, chunks, 0);
    fut.get();
    for (int chunk = 0; chunk < chunks; ++chunk) {
        xfilter::sync_output<int>(out, outSz[chunk], chunk * chunkSz);
    }
});

The custom_sync_outputs() method registers a callback function which will determine which buffer/sub-buffers will be synced back to the host. This method would have to be called inside the send_while body, before the first compute() call.

Important: Calling custom_sync_outputs() disables any automatic transfer-back of output buffers.

Inside the callback function the user-defined code has full control of what is synced back and if there needs to be synchronization. The example above syncs back an outSz buffer. This buffer contains the sizes of the compressed chunks. The sync_output API returns a std::future. Calling fut.get() on that future will make the code wait for that sync to finish. Then the code transfers all chunks, with their exact sizes, back to the host. These calls will also return a future, but there is no need to synchronize for those syncs anymore. The System Compilation runtime layer will make sure that all those syncs have finished before starting the corresponding receive_all iteration.

File Transfer Modes

System Compilation mode also supports easy reading and writing to files as shown in the file_filter_sc in Supported Platforms and Startup Examples. This can be enabled in the create_bufpool() method using either of the following modes as discussed in VPP_ACC Class API:

  • vpp::p2p mode : the file is transferred over the peer-to-peer PCIe bridge to the accelerator. This works only on platforms that support the P2P feature, for example the U2 card with a connected smartSSD.
  • vpp::h2c mode : file is transfer from a host CPU (connected file server) to the card over PCIe. This is standard for most Alveo cards connected to a host CPU over PCIe.

This simple file-transfer switch (vpp::p2p and vpp::h2c) makes accelerator design portable across platforms. It it is a simple matter to test a design for any platform using software emulation on a typical host CPU (connected to a file server) hosting the data files. However, eventually, the design must be compiled for a specific platform supporting P2P, such as the U2 card connected to a smartSSD, thereby allowing direct porting without changing the design sources.

Tip: Running hardware emulation or running on hardware will not work on an Alveo platform that does not support P2P.

The following example code demonstrates this:

   auto inBP = my_acc::create_bufpool(vpp::input , P2P ? vpp::p2p : vpp::h2c);
   my_acc::send_while([=]() {
       if (P2P) o_flags |= O_DIRECT;
       int fd = open(fnm, o_flags, s_flags);
       DT* in = (DT*)xfilter::file_buf(inBP, fd, fsz);
       my_acc::compute(in, ...);
      ...

Based on the value of the P2P flag, this code will either do peer-to-peer (P2P) transfer of a host mapped NVMe device file, or it will do host file to device transfer (H2C). In the P2P case, the file will be loaded into the device memory directly from the NVMe device, without any data transfer through the host. In case of H2C, the System Compilation runtime layer will automatically transfer the file from the host to the device buffer (or vice versa for outputs).

To enable P2P the file has to be opened with O_DIRECT flag specified as shown in the example above. When in P2P mode, the host pointer, as returned by the call to VPP_ACC::file_buf, is a handle for the compute call argument, and cannot be read from or written to.

Multi-File Buffers

As described in VPP_ACC Class API, you can make multiple calls to the file_buf method before calling the compute() method, to map multiple files, or multiple file segments into a single device buffer. The code example below shows small portions of multiple files being processed simultaneously by the accelerator in one compute() call.

void* VPP_ACC::file_buf(VPP_BP bp, int fd, size_t sz, off_t fos = 0, off_t bos = 0)

xfilter::send_while([=, &total_out_size]()
{
    static int iter = 0;
    int* in;
    // collect all "chunks" input files into one "in" buffer
    for (int chunk = 0; chunk < chunks; ++chunk) {
        std::stringstream nm;
        nm << DATA << iter << '-' << chunk << ".orig";
        int ifd = open(nm.str().c_str(), rd_o_flags);
        assert(ifd > 2);
        in = xfilter::file_buf<int>(inBP, ifd, chunkSz, 0, chunk * chunkSz);
    }
    // prepare output buffer to be able to hold all chunks
    int* out = xfilter::file_buf<int>(outBP, 0, chunks * chunkSz, 0);
    // output buffer to provide the actual filtered size of each chunk
    int* outSz = xfilter::alloc_buf<int>(outSzBP, chunks);
....
    xfilter::compute(chunks, chunkSz, in, out, outSz);
....
});

The code creates an input (in) buffer that holds these file segments, and an output (out) buffer that holds the processed output data. In every send_while iteration the file_buf() provides the chunkSz and read offset (chunk*chunkSz) for each file associated with the in buffer. In the subsequent call to compute() all those files segments will be written to the input or read from the output device buffers.

Note: The assignment in = xfilter::file_buf<...> is repeated in the for-loop so that the last returned pointer is assigned to in. This is important to follow while writing the application code.

Custom Transfer of Output files

You can also use custom sync for file buffers, in which an output buffer can be synced to a file descriptor in custom_sync_outputs(). As explained in VPP_ACC Class API, you can do this by calling the sync_output_to_file() method as shown in the following example.

VPP_ACC::sync_output_to_file(void* buf, int fd, size_t byte_sz,
                             off_t fd_byte_offset = 0,
                             off_t buf_byte_offset = 0);

In this case files added by a call to VPP_ACC::file_buf(bufPool, fd, sz) will not get synced automatically. So a call to the file_buf API is not needed to actually add a file, but it is required only to return a host pointer. Adding a dummy file like this is the best approach: VPP_ACC::file_buf(bufPool,0,0);

Here's an example code snippet:

    my_acc::send_while([=]() {
       DT* out = (DT*)my_acc::file_buf(outBP, 0, 0);
       ...
       my_acc::custom_sync_outputs([=](){
           ...
           auto fut = my_acc::sync_output_to_file(out, fd, sz, fd_offset, buf_offset);
           ...
       });
       ...