At first, we are going to understand the flow of data transfer using the non-p2p version of the host code host.cpp
A simple increment kernel is used for the source device which takes an input buffer and generates an output buffer.
out1 = in1 + scalar
The input buffer of the destination device is in2
. We want the data to be transferred from out1
to in2
. For that purpose, the mapped pointer of out1
, out1_ptr
is used to create the buffer in2
.
// Output buffer of the device1
auto out1 = xrt::bo(device1, vector_size_bytes, xrt::bo::flags::normal, krnl.group_id(1));
//Mapped pointer
auto out1_map = out1.map<int*>();
//Input buffer of device2, created using mapped-pointer of device1's output buffer
auto in2 = xrt::bo(device2, out1_map, vector_size_bytes, 0);
After the kernel on source device finishes its execution, the out1 buffer is updated inside the source device’s global memory. Now in order to transfer the data into the destination device’s global memory we need to do following two DMA transfers
DMA from source device global memory to host-memory by performing device-to-host sync on
out1
DMA from host memory to destination device global memory by performing host-to-device sync on
in2
out1.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
in2.sync(XCL_BO_SYNC_BO_TO_DEVICE);
Note 1: Just for the demonstration purpose this testcase is measuring the throughput of the above two sync operations by executing them number of times. Later in this tutorial you will see the similar throughput calculation in the p2p version of the design too. The throughput number can vary depending on the buffer size (the purpose of this tutorial is not demonstrating a p2p performance, hence using a small buffer). The throughput number can also widely vary depending on several hardware aspects
PCIe slot used for two cards, whether the cards are under same switch
CPU architecture of the server, specifically how PCIe transactions are routed between root ports on the CPU busses
Whether DMA Read or DMA Write is used to transfer the p2p buffer content
Note 2: This tutorial is only showing data transfer from the source device to the destination device. After the data reaches inside the destination device’s global memory, a kernel can be executed on the destination device to use the transferred data. However, the second kernel execution on the destination device is not shown in this tutorial.
Finally, the testcase shows the destination buffer (in2
) content is checked for correctness.