Data Transfer Interface Considerations - 2024.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
UG1393
Release Date
2024-07-03
Version
2024.1 English

In a system design, it is important to correctly specify the mode of data transfer between the accelerator and the host. The following sections provide more details on the design aspects.

Global Memory I/O

VSC allows two modes of transfers between the global memory and the accelerator interface as described in Guidance Macros:

  1. DATA_COPY - drop-in a data mover, an efficiently designed RTL IP, that will automate certain features like bursting and data width manipulation on M_AXI interface to global memory. On the accelerator side, this interface supports both sequential access of data as in the Vitis HLS ap_fifo interface, and random access of data as in the Vitis HLS ap_memory interface.
  2. ZERO_COPY - connects the M_AXI interface from the global memory platform port to the accelerator

The following considerations are recommended for definition of the compute() function interface:

  1. If the data size (all compute arguments) is too big to fit into the device DDR, split up the large alloc_buf into smaller chunks of computation over multiple send iterations.
  2. It is generally recommended to use DATA_COPY and SEQUENTIAL access pattern, especially when the accelerator code does sequential input data access.
  3. If the accelerator code does random access of the data, and the code cannot be modified to access the data sequentially, then use the random access pattern if the data fits into device RAM (BRAM or URAM). Otherwise use ZERO_COPY.

Sequential Access Pattern

For a PE port data that is accessed sequentially by the kernel code, use ACCESS_PATTERN(data, SEQUENTIAL);

VSC requires the size of the data at runtime. This must be done through the DATA_COPY macro. For example, DATA_COPY(data, data[numData]);

Both data and numData will have to be passed as arguments to the kernel, even if the kernel for example would not use numData, it still has to be provided as an argument. In general, numData can be any expression as long as it can be evaluated at runtime in terms of the kernel function arguments. The following example is allowed in the accelerator class declaration, though perhaps not very practical:

DATA_COPY(data, data[m * log(n) + 5]);
...
static void PE_func(int n, int m, float* data); 

For both input and output data, the exact size of the DATA_COPY is important and has to exactly match the amount of data the kernel is reading. If the size does not match the design there can be functional issues like:

  • A hang at runtime, if the kernel reads more data than provided by the application code, or
  • the kernel will read garbage data, if the previous kernel call did not read all the data from a previous compute call

To prevent and debug this, you can use ZERO_COPY to make sure that the kernel code works properly.

Random Access Pattern

If the kernel has to access a compute() function argument, data, in a random fashion, use ACCESS_PATTERN(data, RANDOM);

Tip: The random access pattern will require a local FIFO buffer which has size limitations imposed by BRAMs, which are typically in 32 Kb blocks. The on-chip memory will require as many BRAMs as the size of the user-defined argument.

Therefore, you must ensure that the data would fit in the on-chip FPGA memory. The kernel code must declare the data as a static array, for example:

ACCESS_PATTERN(data, RANDOM);
...
static void PE_func(int n, int m, float data[64]);

If the data size is accessed randomly and too big for on chip FPGA memory, you should use ZERO_COPY(data) .

Important: Do not use an ACCESS_PATTERN macro together with ZERO_COPY.

The unit amount of data transferred between host and the global memory is not necessarily the same as the DATA_COPY size. It is actually determined by the size argument of VPP_ACC::alloc_buf() call. This size can be bigger than the data size needed for each kernel compute, for example when sending data for multiple compute() calls in one-shot. Thus, clustering PCIe data transfers say for N calls to compute() is easily done as follows:

send_while ... { ...
  clustered_buffer = acc::alloc_buf( N * size );
  for (i = 0; i < N; i++) { ...
    acc::compute(&clustered_buf[ i * size ] ...
  }
  1. Allocate the appropriate data buffer size for the N compute calls
  2. Call compute() N times where each call indexes into the clustered buffer

Compute Payload Data Type

The compute() data type also determines the data layout on the global memory and therefore will affect accelerator performance. To allow the kernel to access the data as fast as possible it is important to choose the appropriate data type for the compute() arguments. For example, if the kernel is processing integers and is required to process one integer every clock cycle (the HLS II = 1), then the interface can use an array of integers, such as:

static void compute ( int* A );

Consider another example with the following two coding styles that add four integers in every compute call.

// --- acc interface
DATA_COPY(data, data[numData*4]); 

// --- application code
int data[numData*4];
...

static void acc::compute(int* data, ... );

// --- 4-cycle kernel code
void PE_func(int* data, int numData, int *out) {
  for (int i=0; i < numData; i++) {
    int o = i * 4;
    out = data[o+0]+data[o+1]+data[o+2]+data[o+3];
  }
// --- acc interface
DATA_COPY(data, data[numData]); 

// --- application code
struct data_t { int i[4]; };
data_t *data;
...
static void acc::compute(data_t* data, ... );

// --- 1-cycle kernel code with packed data type
void PE_func(data_t* di, int numData, int *out) {
  for (int i=0; i < numData; i++) {
    data_t data = di[i];
    out = data.i[0]+data.i[1]+data.i[2]+data.i[3];
  }

The kernel adds up every four integers from the input array. The straightforward implementation on the left would need 4 clock cycles for each result, assuming one cycle per memory access. However, it would be better to pack all 4 integers into a single global memory access, as shown on the right. Therefore, in this case it is recommended to:

  • Use a C-struct to pack all the data and pass to the kernel
  • Ensure the correct data copy size is provided, using DATA_COPY(data, data[numData]);

Some other key points to note:

  • Using int* data[4] will not pack the integers and will result in the same hardware as int* data;
  • Typically packing more than 64 bytes (for a 512-bit M_AXI bus width) will not improve performance, but should also not degrade performance