In a system design, it is important to correctly specify the mode of data transfer between the accelerator and the host. The following sections provide more details on the design aspects.
Global Memory I/O
VSC allows two modes of transfers between the global memory and the accelerator interface as described in Guidance Macros:
-
DATA_COPY
- drop-in a data mover, an efficiently designed RTL IP, that will automate certain features like bursting and data width manipulation on M_AXI interface to global memory. On the accelerator side, this interface supports both sequential access of data as in the Vitis HLSap_fifo
interface, and random access of data as in the Vitis HLSap_memory
interface. -
ZERO_COPY
- connects the M_AXI interface from the global memory platform port to the accelerator
The following considerations are recommended for definition of the
compute()
function interface:
- If the data size (all compute arguments) is too big to fit into
the device DDR, split up the large
alloc_buf
into smaller chunks of computation over multiple send iterations. - It is generally recommended to use
DATA_COPY
andSEQUENTIAL
access pattern, especially when the accelerator code does sequential input data access. - If the accelerator code does random access of the data, and the
code cannot be modified to access the data sequentially, then use the random
access pattern if the data fits into device RAM (BRAM or URAM). Otherwise use
ZERO_COPY
.
Sequential Access Pattern
For a PE port data
that is accessed
sequentially by the kernel code, use ACCESS_PATTERN(data,
SEQUENTIAL);
VSC requires the size of the data at runtime. This must be done
through the DATA_COPY
macro. For example, DATA_COPY(data, data[numData]);
Both data
and numData
will have to be passed as arguments to the kernel, even if the
kernel for example would not use numData
, it still
has to be provided as an argument. In general, numData
can be any expression as long as it can be evaluated at
runtime in terms of the kernel function arguments. The following example is allowed
in the accelerator class declaration, though perhaps not very practical:
DATA_COPY(data, data[m * log(n) + 5]);
...
static void PE_func(int n, int m, float* data);
For both input and output data, the exact size of the DATA_COPY
is important and has to exactly match the
amount of data the kernel is reading. If the size does not match the design there
can be functional issues like:
- A hang at runtime, if the kernel reads more data than provided by the application code, or
- the kernel will read garbage data, if the previous kernel call did not read all the data from a previous compute call
To prevent and debug this, you can use ZERO_COPY
to make sure that the kernel code works properly.
Random Access Pattern
If the kernel has to access a compute()
function argument, data
,
in a random fashion, use ACCESS_PATTERN(data,
RANDOM);
Therefore, you must ensure that the data would fit in the on-chip FPGA memory. The kernel code must declare the data as a static array, for example:
ACCESS_PATTERN(data, RANDOM);
...
static void PE_func(int n, int m, float data[64]);
If the data size is accessed randomly and too big for on chip FPGA
memory, you should use ZERO_COPY(data)
.
ACCESS_PATTERN
macro together with ZERO_COPY
.The unit amount of data transferred between host and the global memory is not
necessarily the same as the DATA_COPY
size. It is
actually determined by the size argument of VPP_ACC::alloc_buf()
call. This size can be bigger than the data size
needed for each kernel compute, for example when sending data for multiple compute()
calls in one-shot. Thus, clustering PCIe data transfers say for N calls to compute()
is easily done as follows:
send_while ... { ...
clustered_buffer = acc::alloc_buf( N * size );
for (i = 0; i < N; i++) { ...
acc::compute(&clustered_buf[ i * size ] ...
}
- Allocate the appropriate data buffer size for the N compute calls
- Call
compute()
N times where each call indexes into the clustered buffer
Compute Payload Data Type
The compute()
data type also
determines the data layout on the global memory and therefore will affect
accelerator performance. To allow the kernel to access the data as fast as possible
it is important to choose the appropriate data type for the compute()
arguments. For example, if the kernel is processing integers
and is required to process one integer every clock cycle (the HLS II = 1), then the
interface can use an array of integers, such as:
static void compute ( int* A );
Consider another example with the following two coding styles that add four integers in every compute call.
|
|
The kernel adds up every four integers from the input array. The straightforward implementation on the left would need 4 clock cycles for each result, assuming one cycle per memory access. However, it would be better to pack all 4 integers into a single global memory access, as shown on the right. Therefore, in this case it is recommended to:
- Use a C-struct to pack all the data and pass to the kernel
- Ensure the correct data copy size is provided, using
DATA_COPY(data, data[numData]);
Some other key points to note:
- Using
int* data[4]
will not pack the integers and will result in the same hardware as justint* data;
- Typically packing more than 64 bytes (for a 512-bit M_AXI bus width) will not improve performance, but should also not degrade performance