Working with User-Managed Kernels

Working with User-Managed Kernels - 2023.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2023-07-17

Version

2023.1 English

Tip: For an example of a software application working with user-managed RTL kernel refer to Vitis-Tutorials/Hardware_Acceleration/Feature_Tutorials/01-rtl-kernel-workflow.

User-managed kernels require the use of the XRT native API for the software application, and are specified as an IP object of the xrt::ip class. The following is a high-level overview of how to structure your host application to access user-managed kernels from an .xclbin file.

Important: XRT has a 16-bit (64K) address width limitation for the s_axilite interface.

Add the following header files to include the XRT native API:
```
#include "experimental/xrt_ip.h"
#include "xrt/xrt_bo.h"
```
- experimental/xrt_ip.h: Defines the IP as an object of xrt::ip.
- xrt/xrt_bo.h: Lets you create buffer objects in the XRT native API.
Set up the application environment as described in Specifying the Device ID and Loading the XCLBIN.
The IP object (xrt::ip) is constructed from the xrt::device object, the uuid of the loaded .xclbin, and the name of the user-managed kernel:
```
//User Managed Kernel = IP
auto ip = xrt::ip(device, uuid, "Vadd_A_B");
```
Optionally, for Versal AI Core devices with AI Engine graph applications, you can also specify the graph application to load at run time. This process requires a few sub-tasks as shown:
1. Add required header to #include statement:
```
#include <experimental/xrt_aie.h>
```
2. Identify the AI Engine graph from the xrt::device object, the uuid of the loaded .xclbin, and the name of the graph application:
```
  auto my_graph  = xrt::graph(device, uuid, "mygraph_top");
```
3. Reset and run the graph application from the software program as needed:
```
  my_graph.reset();
  std::cout << STR_PASSED << "my_graph.reset()" << std::endl;
  my_graph.run(0);
  std::cout << STR_PASSED << "my_graph.run()" << std::endl;
```
Tip: For more information on building and running AI Engine applications, refer to AI Engine Tools and Flows User Guide (UG1076).
Create buffers for the IP arguments:
```
auto <buf_name> = xrt::bo(<device>,<DATA_SIZE>,<flag>,<bank_id>);
```
Where the buffer object constructor uses the following fields:
- <device>: xrt::device object of the accelerator card.
- <DATA_SIZE>: Size of the buffer as defined by the width and quantity of data.
- <flag>: Flag for creating the buffer objects.
- <bank_id>: Defines the memory bank on the device where the buffer should be allocated for IP access. The memory bank specified must match with the corresponding IP port's connection inside the .xclbin file. Otherwise you will get bad_alloc when running the application. You can specify the assignment of the kernel argument using the --connectivity.sp command as explained in Mapping Kernel Ports to Memory.
For example:
```
auto buf_in_a = xrt::bo(device,DATA_SIZE,xrt::bo::flags::normal,0);
auto buf_in_b = xrt::bo(device,DATA_SIZE,xrt::bo::flags::normal,0);
```
Tip: Verify the IP connectivity to determine the specific memory bank, or you can get this information from the Vitis generated .xclbin.info file.

For example, the following information for a user-managed kernel from the .xclbin could guide the construction of buffer objects in your host code:
```
Instance:        Vadd_A_B_1
   Base Address: 0x1c00000

   Argument:          scalar00
   Register Offset:   0x10
   Port:              s_axi_control
   Memory:            <not applicable>

   Argument:          A
   Register Offset:   0x18
   Port:              m00_axi
   Memory:            bank0 (MEM_DDR4)

   Argument:          B
   Register Offset:   0x24
   Port:              m01_axi
   Memory:            bank0 (MEM_DDR4)
```
Get the buffer addresses and transfer data between host and device:
```
    auto a_data = buf_in_a.map<int*>();
    auto b_data = buf_in_b.map<int*>();

    // Get the buffer physical address
    long long a_addr=buf_in_a.address();
    long long b_addr=buf_in_b.address();

    // Sync Buffers
    buf_in_a.sync(XCL_BO_SYNC_BO_TO_DEVICE);
    buf_in_b.sync(XCL_BO_SYNC_BO_TO_DEVICE);
```
xrt::bo::map() allows mapping the host-side buffer backing pointer to a user pointer. However, before reading from the mapped pointer or after writing to the mapped pointer, you should use xrt::bo::sync() with direction flag for the DMA operation.
After preparing the buffer (buffer create, sync operation as shown above), you are free to pass all the necessary information to the IP with the direct register write operation.

Important: The xrt::ip differs from the standard xrt::kernel, and indicates that XRT does not manage the IP but does provide access to read or write the registers.

For example, the code below shows the information passing the buffer base address through the xrt::ip::write_register() command.
```
    ip.write_register(REG_OFFSET_A,a_addr);
    ip.write_register(REG_OFFSET_A+4,a_addr>>32);

    ip.write_register(REG_OFFSET_B,b_addr);
    ip.write_register(REG_OFFSET_B+4,b_addr>>32);
```

Start the IP execution. Because the IP is user-managed, you can employ any number of register write/read to control the start/check status/restart the IP to trigger the execution of the IP. The following example uses an s_axilite interface to access control signals in the control register:

    uint32_t axi_ctrl = 0;
    std::cout << "INFO:IP Start" << std::endl;
    axi_ctrl = IP_START;
    ip.write_register(CSR_OFFSET, axi_ctrl);

    // Wait until the IP is DONE 
    axi_ctrl =0;
    while((axi_ctrl & IP_IDLE) != IP_IDLE) {
        axi_ctrl = ip.read_register(CSR_OFFSET);
    }

After IP execution is finished, you can transfer the data back to host by the xrt::bo::sync command with the appropriate flag to dictate the buffer transfer direction.
```
    buf_in_b.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
```
Optionally profile the application.
Because XRT is not in charge of starting or stopping the kernel, you cannot directly profile the operation of user_managed kernels as you would XRT managed kernels. However, you can use the user_range and user_event objects as discussed in Custom Profiling of the Host Application to profile elements of the host application. For example the following code captures the time it takes to write the registers from the host application:
```
    // Write Registers
    range.start("Phase 4a", "Write A Register");
    ip.write_register(REG_OFFSET_A,a_addr);
    ip.write_register(REG_OFFSET_A+4,a_addr>>32);
    range.end();
    range.start("Phase 4b", "Write B Register");
    ip.write_register(REG_OFFSET_B,b_addr);
    ip.write_register(REG_OFFSET_B+4,b_addr>>32);
    range.end()
```
You can observe some aspects of the application and kernel operation in the Vitis analyzer as shown in the following figure.