Step 1: Understanding the Original (non-p2p) Version of the Host Code - 2023.1 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
Release Date
2023.1 English

The first step is to understand the flow of data transfer using the non-p2p version of the host code host.cpp. A simple increment kernel is used for the source device which takes an input buffer and generates an output buffer.

out1 = in1 + scalar

The input buffer of the destination device is in2. Data has to be transferred from out1 to in2. For that purpose, the mapped pointer of out1, out1_ptr is used to create the buffer in2.


Fig 1: Dataflow without p2p from source device to destination device using two buffers

// Output buffer of the device1
auto out1 = xrt::bo(device1, vector_size_bytes, xrt::bo::flags::normal, krnl.group_id(1));
//Mapped pointer
auto out1_map =<int*>();
//Input buffer of device2, created using mapped-pointer of device1's output buffer
auto in2 = xrt::bo(device2, out1_map, vector_size_bytes, 0);

After the kernel on source device finishes its execution, the out1 buffer is updated inside the source device’s global memory. In order to transfer the data into the destination device’s global memory, the following DMA transfers are required:

  • DMA from source device global memory to host-memory by performing device-to-host sync on out1

  • DMA from host memory to destination device global memory by performing host-to-device sync on in2



This test case measures the throughput of the above two sync operations by executing them for a number of times. Similar throughput calculation in the p2p version of the design can be seen later in this tutorial. The throughput number can vary depending on the buffer size. Because the purpose of this tutorial is not to demonstrate a p2p performance, a small buffer is used. The throughput number can also widely vary depending on several hardware aspects such as:

  • PCIe slot used for two cards, whether the cards are under same switch

  • CPU architecture of the server, specifically how PCIe transactions are routed between root ports on the CPU busses

  • Whether DMA Read or DMA Write is used to transfer the p2p buffer content


This tutorial only shows data transfer from the source device to the destination device. After the data reaches the destination device’s global memory, a kernel can be executed on the destination device to use the transferred data. However, the second kernel execution on the destination device is not shown in this tutorial.

Finally, the test case shows the destination buffer (in2) content is checked for correctness.