Options for Controlling AXI4 Burst Behavior - 2023.1 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2023-07-17
Version
2023.1 English
An optimal AXI4 interface is one in which the design never stalls while waiting to access the bus, and after bus access is granted, the bus never stalls while waiting for the design to read/write. There are many elements of the design that affect the system performance and burst transfer, such as the following:
  • Latency
  • Port Width
  • Multiple Ports
  • Specified Burst Length
  • Number of Outstanding Reads/Writes

Latency

The read latency is defined as the time taken between sending the burst read request to when the kernel receives the data from the first read request in the burst. Similarly, the write latency is defined as the time it takes between when data for the last write in the burst is sent and the time the write response is received by the kernel. These latencies can be non-deterministic since they depend on system characteristics such as congestion on the DDR access. Because of this Vitis HLS can not accurately determine the memory read/write latency during synthesis, and so uses a default latency of 64 kernel cycles to schedule the requests and operations as below.

  • It schedules the read/write requests and waits for the data, in parallel perform memory-independent operations, such as working on streams or compute
  • Wait to schedule new read/write requests
Tip: The default tool latency can be changed using the LATENCY pragma or directive.

To help you understand the various latencies that are possible in the system, the following figure shows what happens when an HLS kernel sends a burst to the DDR.

Figure 1. Burst Transaction Diagram

When your design makes a read/write request, the request is sent to the DDR through several specialized helper modules. First, the M-AXI adapter serves as a buffer for the requests created by the HLS kernel. The adapter contains logic to cut large bursts into smaller ones (which it needs to do to prevent hogging the channel or if the request crosses the 4 KB boundary, see Vivado Design Suite: AXI Reference Guide (UG1037)), and can also stall the sending of burst requests (depending on the maximum outstanding requests parameter) so that it can safely buffer the entirety of the data for each kernel. This can slightly increase write latency but can resolve deadlock due to concurrent requests (read or write) on the memory subsystem. You can configure the M-AXI interface to hold the write request until all data is available using config_interface-m_axi_conservative_mode.

Getting through the adapter will cost a few cycles of latency, typically 5 to 7 cycles. Then, the request goes to the AXI interconnect that routes the kernel’s request to the MIG and then eventually to the DDR. Getting through the interconnect is expensive in latency and can take around 30 cycles. Finally, getting to the DDR and back can cost anywhere from 9 to 14 cycles. These are not precise measurements of latency but rather estimates provided to show the relative latency cost of these specialized modules. For more precise measurements, you can test and observe these latencies using the Application Timeline report for your specific system, as described in AXI Performance Case Study.
Tip: For information about the Application Timeline report, see Application Timeline in the Vitis Unified Software Platform Documentation .

Another way to view the latencies in the system is as follows: the interconnect has an average II of 2 while the DDR controller has an average II of 4-5 cycles on requests (while on the data they are both II=1). The interconnect arbitration strategy is based on the size of read/write requests, and so data requested with longer burst lengths get prioritized over requests with shorter bursts (thus leading to a bigger channel bandwidth being allocated to longer bursts in case of contention). Of course, a large burst request has the side-effect of preventing anyone else from accessing the DDR, and therefore there must be a compromise between burst length and reducing DDR port contention. Fortunately, the large latencies help prevent some of this port contention, and effective pipelining of the requests can significantly improve the bandwidth throughput available in the system.

Latency does not affect loops/functions with pipelined bursts since the burst requests the maximum size in a single request.

Latency effects loops/functions with sequential burst in two possible ways:
  • If the system read/write latency is larger than the default tool latency, Vitis HLS has to wait for the data. Changing the LATENCY pragma or directive will not improve the performance of the system.
  • If the read/write latency is less than the tool default, then Vitis HLS sits in an idle state and wastes the remaining kernel cycles. This can impact the performance of the design because during this idle state it does not perform tasks. As you can see from the figure below the difference between the system latency and the default latency parameter will cause the sequential requests to be delayed further in time. This causes a significant loss of throughput.
Figure 2. Default Tool Latency

However, when you reduce the tool latency using the LATENCY pragma or directive, Vitis HLS will tightly pack the requests for a sequential burst, as shown in the following figure.

Figure 3. Adjusted Tool Latency

Port Width

The throughput of load-store functions can be further improved by maximizing the number of bytes transferred. Vitis HLS supports kernel ports up to 512 bits wide, which means that a kernel can read or write up to 64 bytes per clock cycle per port.

Vitis HLS also supports automatic port width optimization by analyzing the memory access pattern of the source code. If the code satisfies the preconditions and limitations for burst access, it will automatically resize the port to 512 bit width in the Vitis kernel flow.

Important: If the size and number of iterations are variable at compile time, then the tool will not automatically widen port widths.

If the tool cannot automatically widen the port, you can manually change the port width by using Vector Data Types or Arbitrary Precision (AP) Data Types as the data type of the port.

Multiple Ports

The throughput of load-store functions can be further improved by maximizing concurrent read/writes. In Vitis HLS, the function arguments by default are bundled/mapped/grouped to a single port. Bundling ports into a single port helps save resources. However, a single port can limit the performance of the kernel because all the memory transfers have to go through a single port. The m_axi interface has independent READ and WRITE channels, so a single port can read and write simultaneously.

Using multiple ports lets you increase the bandwidth and throughput of the kernel by creating multiple interfaces to connect to different memory banks, as shown in the Multi-DDR tutorial, or the accesses will be sequential. When multiple arguments are accessing the same memory port or memory bank, an arbiter will sequence the concurrent accesses to the same memory port or bank. Having multiple ports connected to different memory banks increases the throughput of the load and store functions, and as a result, the compute block should also be equally scaled to meet the throughput demand from the load and store functions otherwise it will put back-pressure or stalls on the load-store functions.

Number of Outstanding Reads/Writes

The throughput of load-store functions can be further improved by allowing the system to hide some of the memory latency. The m_axi_num_read_outstanding and m_axi_num_write_outstanding options of the config_interface command, or of the INTERFACE pragma or directive, lets the Kernel control the number of pipelined memory requests sent to the global memory without waiting for the previous request to complete.

Increasing the number of pipelined requests increases the pipeline depth of the read/write requests, which will cost additional BRAM/URAM resources.

Note: In most cases where burst length >=16, the default number of outstanding reads/writes should be sufficient. For a burst of size less than 16, AMD recommends doubling the size of the number of outstanding from the default of 16.

Defining Burst Attributes with the INTERFACE Pragma

To create the optimal AXI4 interface, the following command options are provided in the INTERFACE directive to specify the behavior of the bursts and optimize the efficiency of the AXI4 interface.

Note that some of these options can use internal storage to buffer data and this may have an impact on area and resources:

latency
Specifies the expected latency of the AXI4 interface, allowing the design to initiate a bus request several cycles (latency) before the read or write is expected. If this figure it too low, the design will be ready too soon and may stall waiting for the bus. If this figure is too high, bus access may be granted but the bus may stall waiting on the design to start the access. Default latency in Vitis HLS is 64.
max_read_burst_length
Specifies the maximum number of data values read during a burst transfer. Default value is 16.
num_read_outstanding
Specifies how many read requests can be made to the AXI4 bus, without a response, before the design stalls. This implies internal storage in the design: a FIFO of size num_read_outstanding*max_read_burst_length*word_size. Default value is 16.
max_write_burst_length
Specifies the maximum number of data values written during a burst transfer. Default value is 16.
num_write_outstanding
Specifies how many write requests can be made to the AXI4 bus, without a response, before the design stalls. This implies internal storage in the design: a FIFO of size num_read_outstanding*max_read_burst_length*word_size. Default value is 16.
The following INTERFACE pragma example can be used to help explain these options:
#pragma HLS interface mode=m_axi port=input offset=slave bundle=gmem0         
depth=1024*1024*16/(512/8) latency=100 num_read_outstanding=32 num_write_outstanding=32 
max_read_burst_length=16 max_write_burst_length=16
  • The interface is specified as having a latency of 100. The HLS compiler seeks to schedule the request for burst access 100 clock cycles before the design is ready to access the AXI4 bus.
  • To further improve bus efficiency, the options num_write_outstanding and num_read_outstanding ensure the design contains enough buffering to store up to 32 read and/or write accesses. Each request will require its own buffer. This allows the design to continue processing until the bus requests are serviced.
  • Finally, the options max_read_burst_length and max_write_burst_length ensure the maximum burst size is 16 and that the AXI4 interface does not hold the bus for longer than this. The HLS tool will partition longer bursts according to the specified burst length, and report this condition with a message like the following:
    Multiple burst reads of length 192 and bit width 128 in loop 'VITIS_LOOP_2'(./src/filter.cpp:247:21)has been inferred on port 'mm_read'.
    These burst requests might be further partitioned into multiple requests during RTL generation based on the max_read_burst_length settings.

Commands to Configure the Burst

These commands configure global settings for the tool to optimize the AXI4 interface for the system in which it will operate. The efficiency of the operation depends on these values being set accurately. The provided default values are conservative, and may require changing depending on the memory access profile of your design.

Table 1. Vitis HLS Controls
Vitis HLS Command Value Description
config_rtl -m_axi_conservative_mode bool

default=true

Delay M-AXI each write request until the associated write data are entirely available (typically, buffered into the adapter or already emitted). This can slightly increase write latency but can resolve deadlock due to concurrent requests (read or write) on the memory subsystem.
config_interface -m_axi_latency uint

0 is auto

default=0 (for Vivado IP flow)

default=64 (for Vitis Kernel flow)

Provide the scheduler with an expected latency for M-AXI accesses. Latency is the delay between a read request and the first read data, or between the last write data and the write response. Note that this number need not be exact, underestimation makes for a lower-latency schedule, but with longer dynamic stalls. The scheduler will account for the additional adapter latency and add a few cycles.
config_interface -m_axi_min_bitwidth uint

default=8

Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does not necessarily increase throughput if the actual accesses are smaller than the required interface.
config_interface -m_axi_max_bitwidth uint

default=1024

Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does decrease throughput if the actual accesses are bigger than the required interface as they will be split into a multi-cycle burst of accesses.
config_interface -m_axi_max_widen_bitwidth uint

default=0 (for Vivado IP flow)

default=512 (for Vitis Kernel flow)

Allow the tool to automatically widen bursts on M-AXI interfaces up to the chosen bitwidth. Must be a power-of-two between 8 and 1024. Note that burst widening requires strong alignment properties (in addition to burst).
config_interface -m_axi_auto_max_ports bool

default=false

If the option is false, all the M-AXI interfaces that are not explicitly bundled will be bundled into a single common interface, thus minimizing resource usage (single adapter). If the option is true, all the M-AXI interfaces that are not explicitly bundled will be mapped into individual interfaces, thus increasing the resource usage (multiple adapters).
config_interface -m_axi_alignment_byte_size uint

default=1 (for Vivado IP flow)

default=64 (for Vitis Kernel flow)

Assume top function pointers that are mapped to M-AXI interfaces are at least aligned to the provided width in byte (power of two). This can help automatic burst widening. Warning: behavior will be incorrect if the pointers are not actually aligned at runtime.
config_interface -m_axi_num_read_outstanding uint

default=16

Default value for M-AXI num_read_outstanding interface parameter.
config_interface -m_axi_num_write_outstanding uint

default=16

Default value for M-AXI num_write_outstanding interface parameter.
config_interface -m_axi_max_read_burst_length uint

default=16

Default value for M-AXI max_read_burst_length interface parameter.
config_interface -m_axi_max_write_burst_length uint

default=16

Default value for M-AXI max_write_burst_length interface parameter.