In the Vitis tool flow Vitis HLS provides the ability to automatically re-size
m_axi
interface ports to 512-bits to improve burst
access. However, automatic port width resizing only supports standard C data types and
does not support aggregate types such as ap_int
,
ap_uint
, struct
,
or array
.
Vitis HLS controls automatic port width resizing using the following two commands:
-
config_interface -m_axi_max_widen_bitwidth <N>
: Directs the tool to automatically widen bursts on M-AXI interfaces up to the specified bitwidth. The value of <N> must be a power-of-two between 0 and 1024. -
config_interface -m_axi_alignment_byte_size <N>
: Note that burst widening also requires strong alignment properties. Assume pointers that are mapped tom_axi
interfaces are at least aligned to the provided width in bytes (power of two). This can help automatic burst widening.
config_interface -m_axi_max_widen_bitwidth 512
config_interface -m_axi_alignment_byte_size 64
config_interface -m_axi_max_widen_bitwidth 0
config_interface -m_axi_alignment_byte_size 0
Automatic port width resizing will only re-size the port if a burst access can be seen by the tool. Therefore all the preconditions needed for bursting, as described in AXI Burst Transfers, are also needed for port resizing. These conditions include:
- Must be a monotonically increasing order of access (both in terms of the memory location being accessed as well as in time). You cannot access a memory location that is in between two previously accessed memory locations- aka no overlap.
- The access pattern from the global memory should be in sequential
order, and with the following additional requirements:
- The sequential accesses need to be on a non-vector type
- The start of the sequential accesses needs to be aligned to the widen word size
- The length of the sequential accesses needs to be divisible by the widen factor
The following code example is used in the calculations that follow:
vadd_pipeline:
for (int i = 0; i < iterations; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_len/c_n max = c_len/c_n
// Pipelining loops that access only one variable is the ideal way to
// increase the global memory bandwidth.
read_a:
for (int x = 0; x < N; ++x) {
#pragma HLS LOOP_TRIPCOUNT min = c_n max = c_n
#pragma HLS PIPELINE II = 1
result[x] = a[i * N + x];
}
read_b:
for (int x = 0; x < N; ++x) {
#pragma HLS LOOP_TRIPCOUNT min = c_n max = c_n
#pragma HLS PIPELINE II = 1
result[x] += b[i * N + x];
}
write_c:
for (int x = 0; x < N; ++x) {
#pragma HLS LOOP_TRIPCOUNT min = c_n max = c_n
#pragma HLS PIPELINE II = 1
c[i * N + x] = result[x];
}
}
}
}
The width of the automatic optimization for the code above is performed in three steps:
- The tool checks for the number of access patterns in the read_a loop. There is one access during one loop iteration, so the optimization determines the interface bit-width as 32= 32 *1 (bitwidth of the int variable * accesses).
- The tool tries to reach the default max specified by the
config_interface -m_axi_max_widen_bitwidth 512
, using the following expression terms:length = (ceil((loop-bound of index inner loops) * (loop-bound of index - outer loops)) * #(of access-patterns))
- In the above code, the outer loop is an imperfect loop so
there will not be burst transfers on the outer-loop. Therefore the length
will only include the inner-loop. Therefore the formula will be shortened
to:
length = (ceil((loop-bound of index inner loops)) * #(of access-patterns))
or: length = ceil(128) *32 = 4096
- In the above code, the outer loop is an imperfect loop so
there will not be burst transfers on the outer-loop. Therefore the length
will only include the inner-loop. Therefore the formula will be shortened
to:
- Is the calculated length a power of 2? If Yes, then the length will be capped
to the width specified by
-m_axi_max_widen_bitwidth
.
There are some pros and cons to using the automatic port width resizing which you should consider when using this feature. This feature improves the read latency from the DDR as the tool is reading a big vector, instead of the data type size. It also adds more resources as it needs to buffer the huge vector and shift the data accordingly to the data path size.