Burst transfers improve the throughput of the I/O of the kernel by reading or writing large chunks of data to the global memory. The larger the size of the burst, the higher the throughput, this metric is calculated as follows ((# of bytes transferred)* (kernel frequency)/(Time)). The maximum kernel interface bitwidth is 512 bits and if the kernel is compiled at 300 MHz, then it can theoretically achieve = (80-95% efficiency of the DDR)*(512* 300 MHz)/1 sec = ~17-19 GBps for a DDR. As explained, Vitis HLS performs automatic burst optimizations which intelligently aggregates the memory accesses of the loops/functions from the user code and performs read/write of a particular size in a single burst request. However, burst transfer also has requirements that can sometimes be overburdening or difficult to meet, as discussed in Preconditions and Limitations of Burst Transfer.
In some cases, where autmatic burst access has failed, an efficient
solution is to re-write the code or use manual burst. In such cases, if you are familiar
with the AXI4
m_axi
protocol, and understand hardware transaction
modeling, you can implement manual burst transfers using the hls::burst_maxi
class as described below. Refer to Vitis-HLS-Introductory-Examples/Interface/Memory/manual_burst on GitHub for
examples of these concepts. Another solution might be to use cache memory in the
AXI4 interface using the CACHE pragma or directive.
hls::burst_maxi Class
The hls::burst_maxi
class provides a
mechanism to perform read/write access to the DDR memory. These methods will
translate the class methods usage behavior into respective AXI4 protocol and send and receive requests on the AXI4 bus signals - AW, AR, WDATA, BVALID, RDATA.
These methods control the burst behavior of the HLS scheduler. The adapter, which
receives the commands from the scheduler, is responsible for sending the data to the
DDR memory. These requests will adhere to the user specified INTERFACE pragma
options, such as max_read_burst_length
and max_write_burst_length
. The class methods should only
be used in the kernel code, and not in the test bench (except for the class
constructor as described below).
- Constructors:
-
burst_maxi(const burst_maxi &B) : Ptr(B.Ptr) {}
-
burst_maxi(T *p) : Ptr(p) {}
Important: The HLS design and test bench must be in different files, because the constructorburst_maxi(T *p)
is only available in C-simulation model.
-
- Read Methods:
-
void read_request(size_t offset, size_t len);
This method is used to perform a read request to the
m_axi
adapter. The function returns immediately if the read request queue insidem_axi
adapter is not full, otherwise it waits until space becomes available.-
offset
: Specify the memory offset from which to read the data -
len
: Specify the scheduler burst length. This burst length is sent to the adapter, which can then convert it to the standard AXI AMBA protocol
-
-
T read();
This method is used to transfer the data from the
m_axi
adapter to the scheduler FIFO. If the data is not available,read()
will be blocking. Theread()
method should be calledlen
number of times, as specified in theread_request()
.
-
- Write Methods:
-
void write_request(size_t offset, size_t len);
This method is used to perform a write request to the
m_axi
adapter. The function returns immediately if the write request queue insidem_axi
adapter is not full.-
offset
: Specify the memory offset into which the data should be written -
len
: Specify the scheduler burst length. This burst length is sent to the adapter, which can then convert it to the standard AXI AMBA protocol
-
-
void write(const T &val, ap_int<sizeof(T)> byteenable_mask = -1);
This method is used to transfer data from the internal buffer of the scheduler to the
m_axi
adapter. It blocks if the internal write buffer is full. The byteenable_mask is used to enable the bytes in the WDATA. By default it will enable all the bytes of the transfer. Thewrite()
method should be calledlen
number of times, as specified in thewrite_request()
. -
void write_response();
This method blocks until all write responses are back from the global memory. This method should be called the same number of times as
write_request()
.
-
Using Manual Burst in HLS Design
In the HLS design, when you find that automatic burst transfers are
not working as desired, and you cannot optimize the design as needed, you can
implement the read and write transactions using the hls::burst_maxi
object. In this case, you will need to modify your
code to replace the original pointer argument with burst_maxi
as a function argument. These arguments must be accessed by
the explicit read
and write
methods of the burst_maxi
class, as shown in the following examples.
void dut(int *A) {
for (int i = 0; i < 64; i++) {
#pragma pipeline II=1
... = A[i]
}
}
In the modified code below, the pointer is replaced with the hls::burst_maxi<>
class objects and methods. In
the example, the HLS scheduler puts 4 requests of len
16 from port A
to the m_axi
adapter. The Adapter stores them inside a FIFO
and whenever the AW/AR bus is available it will send the request to the global
memory. In the 64 loop iterations, the read()
command issues a blocking call that will wait for the data to come back from the
global memory. After the data becomes available the HLS scheduler will read it from
the m_axi
adapter FIFO.
#include "hls_burst_maxi.h"
void dut(hls::burst_maxi<int> A) {
// Issue 4 burst requests
A.read_request(0, 16); // request 16 elements, starting from A[0]
A.read_request(128, 16); // request 16 elements, starting from A[128]
A.read_request(256, 16); // request 16 elements, starting from A[256]
A.read_request(384, 16); // request 16 elements, starting from A[384]
for (int i = 0; i < 64; i++) {
#pragma pipeline II=1
... = A.read(); // Read the requested data
}
}
In example 2 below, the HLS scheduler/kernel puts 2 requests from port A to
the adapter, the first request of len
2, and the
second request of len
1, for a total of 2 write
requests. It then issues corresponding, because the total burst length is 3 write
commands. The Adapter stores these requests inside a FIFO and whenever the AW, W bus
is available it will send the request and data to the global memory. Finally, two
write_response
commands are used, to await
response for the two write_requests
.
void trf(hls::burst_maxi<int> A) {
A.write_request(0, 2);
A.write(x); // write A[0]
A.write_request(10, 1);
A.write(x, 2); // write A[1] with byte enable 0010
A.write(x); // write A[10]
A.write_response(); // response of write_request(0, 2)
A.write_response(); // response of write_request(10, 1)
}
Using Manual Burst in C-Simulation
You can pass a regular array to the top function, and the array will
be transformed to hls::burst_maxi
automatically by
the constructor.
burst_maxi(T
*p)
constructor is only valid for use in C simulation model.#include "hls_burst_maxi.h"
void dut(hls::burst_maxi<int> A);
int main() {
int Array[1000];
dut(Array);
......
}
Using Manual Burst to Optimize Performance
Vitis HLS characterizes two types of burst behaviors: pipeline burst, and sequential burst.
- Pipeline Burst
- Pipeline Burst improves throughput by reading or writing the maximum
amount of data in a single request. The compiler infers pipeline burst if
the
read_request
,write_request
andwrite_response
calls are outside the loop, as shown in the following code example. In the below example the size is a variable that is sent from the test bench.9 int buf[8192]; 10 in.read_request(0, size); 11 for (int i = 0; i < size; i++) { 12 #pragma HLS PIPELINE II=1 13 buf[i] = in.read(); 14 out.write_request(0, size*NT); 17 for (int i = 0; i < NT; i++) { 19 for (int j = 0; j < size; j++) { 20 #pragma HLS PIPELINE II=1 21 int a = buf[j]; 22 out.write(a); 23 } 24 } 25 out.write_response();
- Sequential Burst
-
This burst is a sequential burst of smaller data sizes, where the read requests, write requests and write responses are inside the loop body as shown in the below snippet. The drawback of the sequential burst is that the future request (i+1) depends on the previous request (i) to finish because it is waiting for the read request, write request and write response to complete, this will cause gaps between requests. Sequential burst is not as effective as pipeline burst because it is reading or writing a small data size multiple times to compensate for the loop bounds. Although this will limit the improvement to throughput, sequential burst is still better than no burst.
Features and Limitations
-
If the
m_axi
element is a struct:- The struct will be packed into a wide int. Disaggregation of the struct is not allowed.
- The size of struct must be a power-of-2, and should not
exceed 1024 bits or the max width specified by the
config_interface -m_axi_max_bitwidth
command.
-
ARRAY_PARTITION and ARRAY_RESHAPE of burst_maxi ports is not allowed.
-
You can apply the INTERFACE pragma or directive to
hls::burst_maxi
, defining anm_axi
interface. If theburst_maxi
port is bundled with other ports, all ports in this bundle must behls::burst_maxi
and must have the same element type.void dut(hls::burst_maxi<int> A, hls::burst_maxi<int> B, int *C, hls::burst_maxi<short> D) { #pragma HLS interface m_axi port=A offset=slave bundle=gmem // OK #pragma HLS interface m_axi port=B offset=slave bundle=gmem // OK #pragma HLS interface m_axi port=C offset=slave bundle=gmem // Bad. C must also be hls::burst_maxi type, because it shares the same bundle 'gmem' with A and B #pragma HLS interface m_axi port=D offset=slave bundle=gmem // Bad. D should have 'int' element type, because it shares the same bundle 'gmem' with A and B }
- You can use the INTERFACE pragma or directive to specify the
num_read_outstanding
andnum_write_outstanding
, and themax_read_burst_length
andmax_write_burst_length
to define the size of the internal buffer of them_axi
adapter.void dut(hls::burst_maxi<int> A) { #pragma HLS interface m_axi port=A num_read_outstanding=32 num_write_outstanding=32 max_read_burst_length=16 max_write_burst_length=16 }
- The INTERFACE pragma or directive
max_widen_bitwidth
is not supported, because HLS will not change the bit width ofhls::burst_maxi
ports. - You must make a
read_request
beforeread
, orwrite_request
beforewrite
:void dut(hls::burst_maxi<int> A) { ... = A.read(); // Bad because read() before read_request(). You can catch this error in C-sim. A.read_request(0, 1); }
- If the address and life time of the read group (
read_request()
>read()
) and write group (write_request()
>write()
>write_response()
) overlap, the tool cannot guarantee the access order. C-simulation will report an error.void dut(hls::burst_maxi<int> A) { A.write_request(0, 1); A.write(x); A.read_request(0, 1); ... = A.read(); // What value is read? It is undefined. It could be original A[0] or updated A[0]. A.write_response(); } void dut(hls::burst_maxi<int> A) { A.write_request(0, 1); A.write(x); A.write_response(); A.read_request(0, 1); ... = A.read(); // this will read the updated A[0]. }
- If multiple
hls::burst_maxi
ports are bundled to samem_axi
adapter and their transaction lifetimes overlap, the behavior is unexpected.void dut(hls::burst_maxi<int> A, hls::burst_maxi<int> B) { #pragma HLS INTERFACE m_axi port=A bundle=gmem depth = 10 #pragma HLS INTERFACE m_axi port=B bundle=gmem depth = 10 A.read_request(0, 10); B.read_request(0, 10); for (int i = 0; i < 10; i++) { #pragma HLS pipeline II=1 …… = A.read(); // get value of A[0], A[2], A[4] … …… = B.read(); // get value of A[1], A[3], A[5] … } }
- Read or write requests and read or writes in different dataflow process are
not supported. Dataflow checker will report an error:
multiple writes in different dataflow processes are not allowed
.For example:void transfer(hls::burst_maxi<int> A) { #pragma HLS dataflow IssueRequests(A); // issue multiple wirte_request() of A Write(A); // multiple writes to A GetResponse(A); // write_response() of A }
Potential Pitfalls
The following are some concerns you must be aware of when implementing manual burst techniques:
- Deadlock: Improper use of manual burst can lead to deadlocks.
Too many
read_requests
beforeread()
commands will cause deadlock because theread_request
loop will push the request into the read requests FIFO, and this FIFO will only be emptied after the read from the global memory is completed. The job of theread()
command is to read the data from the adapter FIFO and mark the request done, after which theread_request
will be popped from the FIFO and a new request can be pushed onto it.//reads/writes. will deadlock if N is larger for (i = 0; i < N; i++) { A.read_request(i * 128, 16);} for (i = 0; i < 16 *N; i++) { … = A.read();} for (int i = 0; i < N; i++) { p.write_request(i * 128, 16); } for (int i = 0; i < N * 16; i++) { p.write(i); } for (int i = 0; i < N; i++) { p.write_response(); }
In the example above, if N is large then the
read_request
and read FIFO will be full as it tends to N/2. The read request loop would not finish, and the read command loop would not start, which results in deadlock.Note: This is case also true forwrite_request()
andwrite()
commands. - AXI protocol violation: There should be an equal number of write requests and write responses. An unequal number of requests and responses would lead to AXI protocol violation