The first step is to understand the flow of data transfer using the non-p2p version of the host code host.cpp
. A simple increment kernel is used for the source device which takes an input buffer and generates an output buffer.
out1 = in1 + scalar
The input buffer of the destination device is in2
. Data has to be transferred from out1
to in2
. For that purpose, the mapped pointer of out1
, out1_ptr
is used to create the buffer in2
.
// Output buffer of the device1
auto out1 = xrt::bo(device1, vector_size_bytes, xrt::bo::flags::normal, krnl.group_id(1));
//Mapped pointer
auto out1_map = out1.map<int*>();
//Input buffer of device2, created using mapped-pointer of device1's output buffer
auto in2 = xrt::bo(device2, out1_map, vector_size_bytes, 0);
After the kernel on source device finishes its execution, the out1 buffer is updated inside the source device’s global memory. In order to transfer the data into the destination device’s global memory, the following DMA transfers are required:
DMA from source device global memory to host-memory by performing device-to-host sync on
out1
DMA from host memory to destination device global memory by performing host-to-device sync on
in2
out1.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
in2.sync(XCL_BO_SYNC_BO_TO_DEVICE);
Note
This test case measures the throughput of the above two sync operations by executing them for a number of times. Similar throughput calculation in the p2p version of the design can be seen later in this tutorial. The throughput number can vary depending on the buffer size. Because the purpose of this tutorial is not to demonstrate a p2p performance, a small buffer is used. The throughput number can also widely vary depending on several hardware aspects such as:
PCIe slot used for two cards, whether the cards are under same switch
CPU architecture of the server, specifically how PCIe transactions are routed between root ports on the CPU busses
Whether DMA Read or DMA Write is used to transfer the p2p buffer content
Note
This tutorial only shows data transfer from the source device to the destination device. After the data reaches the destination device’s global memory, a kernel can be executed on the destination device to use the transferred data. However, the second kernel execution on the destination device is not shown in this tutorial.
Finally, the test case shows the destination buffer (in2
) content is checked for correctness.