Overview of Burst Transfers
Bursting is an optimization that tries to intelligently aggregate your memory accesses to the DDR to maximize the throughput bandwidth and/or minimize the latency. Bursting is one of many possible optimizations to the kernel. Bursting typically gives you a 4-5x improvement while other optimizations, like access widening or ensuring there are no dependencies through the DDR memory, can provide even bigger performance improvements. Typically, bursting is useful when you have contention on the DDR ports from multiple competing kernels.
The Vitis HLS tool supports automatic burst inference, and supports a number of features such as automatic port-widening to facilitate automatic burst access. In some cases, where autmatic burst access has failed, an efficient solution is to re-write the code or use manual burst as described in Using Manual Burst. If that does not work, then another solution might be to use cache memory in the AXI4 interface using the CACHE pragma or directive.
The burst feature of the AXI4 protocol improves the throughput of the load-store functions by reading/writing chunks of data to or from the global memory in a single request. The larger the size of the data, the higher the throughput. This metric is calculated as follows ((#of bytes transferred)* (kernel frequency)/(Time)). The maximum kernel interface bitwidth is 512 bits, and if the kernel is compiled to run at 300 MHz then it can theoretically achieve (512* 300 MHz)/1 sec = ~17 GBps for a DDR memory.
The preceding figure shows how the AXI protocol works. The HLS kernel sends out a read request for a burst of length 8 and then sends a write request burst of length 8. The read latency is defined as the time taken between the sending of the read request burst to when the data from the first read request in the burst is received by the kernel. Similarly, the write latency is defined as the time taken between when data for the last write in the write burst is sent and the time the write acknowledgment is received by the kernel. Read requests are usually sent at the first available opportunity while write requests get queued until the data for each write in the burst becomes available.
syn.interface.m_axi_conservative_mode
option as described in
Interface Configuration, which is enabled as the default behavior. To understand the underlying semantics of burst transfers consider the following code snippet:
for(size_t i = 0; i < size; i++) {
out[f(i)] = in[f(i)]);
}
Vitis HLS performs automatic burst optimization, which intelligently aggregates the memory accesses inside the loops/functions from the user code and performs read/write to the global memory of a particular size. These read/writes are converted into a read request, write request, and write response to the global memory. Depending on the memory access pattern Vitis HLS automatically inserts these read and write requests either outside the loop bound or inside the loop body. Depending on the placement of these requests, Vitis HLS defines two types of burst requests: sequential burst and pipelined burst.
Burst Semantics
For a given kernel, the HLS compiler implements the burst analysis optimization as a multi-pass optimization, but on a per function basis. Bursting is only done for a function and bursting across functions is not supported. The burst optimizations are reported in the Synthesis Summary report, and missed burst opportunities are also reported to help you improve burst optimization.
At first, the HLS compiler looks for memory accesses in the basic blocks of the function, such as memory accesses in a sequential set of statements inside the function. Assuming the preconditions of bursting are met, each burst inferred in these basic blocks is referred to as sequential burst. The compiler will automatically scan the basic block to build the longest sequence of accesses into a single sequential burst.
The compiler then looks at loops and tries to infer what are known as pipeline bursts. A pipeline burst is the sequence of reads/writes across the iterations of a loop. The compiler tries to infer the length of the burst by analyzing the loop induction variable and the trip count of the loop. If the analysis is successful, the compiler can chain the sequences of reads/writes in each iteration of the loop into one long pipeline burst. The compiler today automatically infers a pipeline or a sequential burst, but there is no way to request a specific type of burst. The code needs to be written so as to cause the tool to infer the pipeline or sequential burst.
Pipeline Burst
Pipeline burst improves the throughput of the functions by reading or writing large amounts, or the maximum amount of data in a single request. The advantage of the pipeline burst is that the future requests (i+1) do not have to wait for the current request (i) to finish because the read request, write request, and write response are outside the loop body and performs the requests as soon as possible, as shown in the following code example. This significantly improves the throughput of the functions as it takes less time to read/write the whole loop bound.
rb = ReadReq(i, size);
wb = WriteReq(i, size);
for(size_t i = 0; i < size; i++) {
Write(wb, i) = f(Read(rb, i));
}
WriteResp(wb);
If the compiler can successfully deduce the burst length from the induction
variable (i
) and the trip count of the loop
(size
), it will infer one big pipeline burst and will move the
ReadReq
, WriteReq
and WriteResp
calls outside
the loop, as shown in the Pipeline Burst code example. So, the read requests for all
loop iterations are combined into one read request and all the write requests are
combined into one write request. All read requests are typically sent out
immediately while write requests are only sent out after the data becomes
available.
In this case, the read and write requests for each loop iteration are combined
into one read or write request. However, if any of the preconditions of bursting are
not met, as described in Preconditions and Limitations of Burst Transfer, the
compiler might not infer a pipeline burst but will instead try and infer a
sequential burst where the ReadReq
, WriteReg
and WriteResp
are alongside the read/write accesses being burst optimized, as shown in the
Sequential Burst code example.
Sequential Burst
A sequential burst consists of smaller data sizes where the read requests, write requests, and write responses are inside a loop body as shown in the following code example.
for(size_t i = 0; i < size; i++) {
rb = ReadReq(i, 1);
wb = WriteReq(i, 1);
Write(wb, i) = f(Read(rb, i));
WriteResp(wb);
}
The drawback of sequential burst is that a future request (i+1) depends on the current request (i) finishing because it is waiting for the read request, write request, and write response to complete. This will create gaps between requests as shown in the following figure.
A sequential burst is not as effective as pipeline burst because it is reading or writing a small data size multiple times to compensate for the loop bounds. Although this will have a significant impact on the throughput, sequential burst is still better than no burst. Vitis HLS uses this burst technique if your code does not adhere to the Preconditions and Limitations of Burst Transfer.
max_read_burst_length
and max_write_burst_length
of the INTERFACE pragma or
directive, as discussed in Options for Controlling AXI4 Burst Behavior.