1. Understanding the original (non-p2p) version of the host code - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2022-12-01
Version
2022.2 English

At first, we are going to understand the flow of data transfer using the non-p2p version of the host code host.cpp

A simple increment kernel is used for the source device which takes an input buffer and generates an output buffer.

out1 = in1 + scalar

The input buffer of the destination device is in2. We want the data to be transferred from out1 to in2. For that purpose, the mapped pointer of out1, out1_ptr is used to create the buffer in2.

../../../_images/wo_p2p.PNG

Fig 1: Dataflow without p2p from source device to destination device using two buffers

// Output buffer of the device1
auto out1 = xrt::bo(device1, vector_size_bytes, xrt::bo::flags::normal, krnl.group_id(1));
//Mapped pointer
auto out1_map = out1.map<int*>();
//Input buffer of device2, created using mapped-pointer of device1's output buffer
auto in2 = xrt::bo(device2, out1_map, vector_size_bytes, 0);

After the kernel on source device finishes its execution, the out1 buffer is updated inside the source device’s global memory. Now in order to transfer the data into the destination device’s global memory we need to do following two DMA transfers

  • DMA from source device global memory to host-memory by performing device-to-host sync on out1

  • DMA from host memory to destination device global memory by performing host-to-device sync on in2

out1.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
in2.sync(XCL_BO_SYNC_BO_TO_DEVICE);

Note 1: Just for the demonstration purpose this testcase is measuring the throughput of the above two sync operations by executing them number of times. Later in this tutorial you will see the similar throughput calculation in the p2p version of the design too. The throughput number can vary depending on the buffer size (the purpose of this tutorial is not demonstrating a p2p performance, hence using a small buffer). The throughput number can also widely vary depending on several hardware aspects

  • PCIe slot used for two cards, whether the cards are under same switch

  • CPU architecture of the server, specifically how PCIe transactions are routed between root ports on the CPU busses

  • Whether DMA Read or DMA Write is used to transfer the p2p buffer content

Note 2: This tutorial is only showing data transfer from the source device to the destination device. After the data reaches inside the destination device’s global memory, a kernel can be executed on the destination device to use the transferred data. However, the second kernel execution on the destination device is not shown in this tutorial.

Finally, the testcase shows the destination buffer (in2) content is checked for correctness.