Using Burst Data Transfers

Using Burst Data Transfers - 2020.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2021-03-22

Version

2020.2 English

Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller. Also, check the HLS report for bursting information.

Recommended: Infer burst transfers from successive requests of data from consecutive address locations. Refer to Reading and Writing by Burst for more details.

If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:

Figure 1. Burst Data Transfer with Detailed Kernel Trace

In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16 x 4 byte transaction into a single package transaction of 1 x 64 bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.

Burst inference is heavily dependent on coding style and access pattern. However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:

void kernel(T in[1024], T out[1024]) {
    T tmpIn[1024];
    T tmpOu[1024];
    read(in, tmpIn);
    process(tmpIn, tmpOut);
    write(tmpOut, out);
}

In short, the function read is responsible for reading from the AXI input to an internal variable (tmpIn). The computation is implemented by the function process working on the internal variables tmpIn and tmpOut. The function write takes the produced output and writes to the AXI output. For more information on burst, see the Vitis High-Level Synthesis User Guide (UG1399).

The isolation of the read and write function from the computation results in:

Simple control structures (loops) in the read/write function which makes burst detection simpler.
The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See Kernel Optimization for more information.
The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in the Vitis core development kit can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.