Using Host Pointer Buffers

Using Host Pointer Buffers - 2023.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2023-07-17

Version

2023.1 English

Important: Using CL_MEM_USE_HOST_PTR is not recommended for embedded platforms. Embedded platforms require contiguous memory allocation and should use the CL_MEM_ALLOC_HOST_PTR method, as described in Letting XRT Allocate Buffers.

There are two main parts of a cl_mem object: host side pointer and device side pointer. Before the kernel starts its operation, the device side pointer is implicitly allocated on the device side memory (for example, on a specific location inside the device global memory) and the buffer becomes a resident on the device. Using clEnqueueMigrateMemObjects this allocation and data transfer occur upfront, much ahead of the kernel execution. This especially helps to enable software pipelining if the host is executing the same kernel multiple times, because data transfer for the next transaction can happen when kernel is still operating on the previous data set, and thus hide the data transfer latency of successive kernel executions.

The OpenCL framework provides a number of APIs for transferring data between the host and the device. Typically, data movement APIs, such as clEnqueueWriteBuffer and clEnqueueReadBuffer, implicitly migrate memory objects to the device after they are enqueued. They do not guarantee when the data is transferred, and this makes it difficult for the host application to synchronize the movement of memory objects with the computation performed on the data.

AMD recommends using clEnqueueMigrateMemObjects instead of clEnqueueWriteBuffer or clEnqueueReadBuffer to improve the performance. Using this API, memory migration can be explicitly performed ahead of the dependent commands. This allows the host application to preemptively change the association of a memory object, through regular command queue scheduling, to prepare for another upcoming command. This also permits an application to overlap the placement of memory objects with other unrelated operations before these memory objects are needed, potentially hiding or reducing data transfer latencies. After the event associated with clEnqueueMigrateMemObjects has been marked complete, the host program knows the memory objects have been successfully migrated.

Tip: Another advantage of clEnqueueMigrateMemObjects is that it can migrate multiple memory objects in a single API call. This reduces the overhead of scheduling and calling functions to transfer data for more than one memory object.

The following code shows the use of clEnqueueMigrateMemObjects:

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
      
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
  host_mem_ptr[i] = <... >   
}

cl_mem dev_mem_ptr = clCreateBuffer(context,  
    				 CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    				 sizeof(int) * number_of_words, host_mem_ptr, NULL); 

clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_ptr); 

err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, 
	  NULL, NULL);