An optimal AXI4 interface is one in which the design never stalls while waiting to access the bus, and after bus access is granted, the bus never stalls while waiting for the design to read/write. There are many elements of the design that affect the system performance and burst transfer, such as the following:
- Latency
- Port Width
- Multiple Ports
- Specified Burst Length
- Number of Outstanding Reads/Writes
Latency
The read latency is defined as the time taken between sending the burst read request to when the kernel receives the data from the first read request in the burst. Similarly, the write latency is defined as the time it takes between when data for the last write in the burst is sent and the time the write response is received by the kernel. These latencies can be non-deterministic because they depend on system characteristics such as congestion on the DDR memory access. Because of this the HLS compiler can not accurately determine the memory read/write latency during synthesis, and so uses a default latency of 64 kernel cycles to schedule the requests and operations as below.
- It schedules the read/write requests and waits for the data, in parallel perform memory-independent operations, such as working on streams or compute
- Wait to schedule new read/write requests
To help you understand the various latencies that are possible in the system, the following figure shows what happens when an HLS kernel sends a burst to the DDR memory.
When your design makes a read/write request, the request is sent to the DDR
memory through several specialized helper modules. First, the m_axi
adapter serves as a buffer for the requests
created by the HLS kernel. The adapter contains logic to cut large bursts into
smaller ones (which it needs to do to prevent hogging the channel or if the request
crosses the 4 KB boundary, see
Vivado Design Suite: AXI Reference
Guide (UG1037)).
By default the adapter stalls the sending of burst requests until all
data is available (depending on the maximum outstanding requests parameter) so that
it can safely buffer the entirety of the data for each kernel. This is done to
reduce deadlock due to concurrent requests (read or write) on the memory subsystem,
but can increase write latency. You can disable this hold of write requests by
setting syn.interface.m_axi_conservative_mode=false
as described in Interface Configuration, but increase the risk of
deadlock.
Another way to view the latencies in the system is as follows: the interconnect has an average II of 2 while the DDR memory controller has an average II of 4-5 cycles on requests (while on the data they are both II=1). The interconnect arbitration strategy is based on the size of read/write requests, and so data requested with longer burst lengths get prioritized over requests with shorter bursts (thus leading to a bigger channel bandwidth being allocated to longer bursts in case of contention). Of course, a large burst request has the side-effect of preventing anyone else from accessing the DDR memory, and therefore there must be a compromise between burst length and reducing DDR port contention. Fortunately, the large latencies help prevent some of this port contention, and effective pipelining of the requests can significantly improve the bandwidth throughput available in the system.
Latency does not affect loops/functions with pipelined bursts because the burst requests the maximum size in a single request.
Latency effects loops/functions with sequential burst in two possible ways:
- If the system read/write latency is larger than the default tool latency, Vitis HLS has to wait for the data. Changing the latency will not improve the performance of the system.
- If the read/write latency is less than the tool default, then Vitis HLS sits in an idle state and wastes the remaining kernel cycles. This can impact the performance of the design because during this idle state it does not perform tasks. As you can see from the following figure the difference between the system latency and the default latency parameter will cause the sequential requests to be delayed further in time. This causes a significant loss of throughput.
However, when you reduce the tool latency using the LATENCY pragma or directive, the tool will tightly pack the requests for a sequential burst, as shown in the following figure.
Port Width
The throughput of load-store functions can be further improved by maximizing
the number of bytes transferred. The Vitis unified
IDE and v++
supports kernel ports up to 1024 bits
wide, which means that a kernel can read or write up to 128 bytes per clock cycle
per port.
Vitis HLS also supports automatic port width optimization by analyzing the memory access pattern of the source code. If the code satisfies the preconditions and limitations for burst access, it will automatically resize the port to 512 bit width in the Vitis kernel flow.
If the tool cannot automatically widen the port, you can manually change the port width by using Vector Data Types or Arbitrary Precision (AP) Data Types as the data type of the port.
Multiple Ports and Channels
The throughput of load-store functions can be further improved by maximizing
concurrent read/writes. In Vitis HLS, the
function arguments by default are bundled/mapped/grouped to a single port. Bundling
ports into a single port helps save resources. However, a single port can limit the
performance of the kernel because all the memory transfers have to go through a
single port. The m_axi
interface has independent
READ and WRITE channels, so a single port can read and write simultaneously. The
m_axi
Interface also provides channels in a
single port, as described in M_AXI Channels to increase read and
write channels in the port.
Using multiple ports lets you increase the bandwidth and throughput of the kernel by creating multiple interfaces to connect to different memory banks, as shown in the Multi-DDR tutorial, or the accesses will be sequential. When multiple arguments are accessing the same memory port or memory bank, an arbiter will sequence the concurrent accesses to the same memory port or bank. Having multiple ports connected to different memory banks increases the throughput of the load and store functions, and as a result, the compute block should also be equally scaled to meet the throughput demand from the load and store functions otherwise it will put back-pressure or stalls on the load-store functions.
Number of Outstanding Reads/Writes
The throughput of load-store functions can be further improved by
allowing the system to hide some of the memory latency. The m_axi_num_read_outstanding
and m_axi_num_write_outstanding
options of the config_interface
command, or of the INTERFACE pragma or directive,
lets the Kernel control the number of pipelined memory requests sent to the global
memory without waiting for the previous request to complete.
Increasing the number of pipelined requests increases the pipeline depth of the read/write requests, which will cost additional BRAM/URAM resources.
Defining Burst Attributes with the INTERFACE Pragma
To create the optimal AXI4 interface, the following command options are provided in the INTERFACE directive to specify the behavior of the bursts and optimize the efficiency of the AXI4 interface.
Note that some of these options can use internal storage to buffer data and this may have an impact on area and resources:
-
latency
- Specifies the expected latency of the AXI4 interface, allowing the design to initiate a bus request several cycles (latency) before the read or write is expected. If this figure it too low, the design will be ready too soon and may stall waiting for the bus. If this figure is too high, bus access may be granted but the bus may stall waiting on the design to start the access. Default latency in Vitis HLS is 64.
-
max_read_burst_length
- Specifies the maximum number of data values read during a burst transfer. Default value is 16.
-
num_read_outstanding
- Specifies how many read requests can be made to the
AXI4 bus, without a response,
before the design stalls. This implies internal storage in the design: a
FIFO of size
num_read_outstanding
*max_read_burst_length
*word_size
. Default value is 16. -
max_write_burst_length
- Specifies the maximum number of data values written during a burst transfer. Default value is 16.
-
num_write_outstanding
- Specifies how many write requests can be made to the AXI4 bus, without a response, before the
design stalls. This implies internal storage in the design: a FIFO of size
num_write_outstanding
*max_write_burst_length
*word_size
. Default value is 16.
#pragma HLS interface mode=m_axi port=input offset=slave bundle=gmem0
depth=1024*1024*16/(512/8) latency=100 num_read_outstanding=32 num_write_outstanding=32
max_read_burst_length=16 max_write_burst_length=16
- The interface is specified as having a latency of 100. The HLS compiler seeks to schedule the request for burst access 100 clock cycles before the design is ready to access the AXI4 bus.
- To further improve bus efficiency, the options
num_write_outstanding
andnum_read_outstanding
ensure the design contains enough buffering to store up to 32 read and/or write bursts. Each request will require its own buffer. This allows the design to continue processing until the bus requests are serviced. - Finally, the options
max_read_burst_length
andmax_write_burst_length
ensure the maximum burst size is 16 and that the AXI4 interface does not hold the bus for longer than this. The HLS tool will partition longer bursts according to the specified burst length, and report this condition with a message like the following:Multiple burst reads of length 192 and bit width 128 in loop 'VITIS_LOOP_2'(./src/filter.cpp:247:21)has been inferred on port 'mm_read'. These burst requests might be further partitioned into multiple requests during RTL generation based on the max_read_burst_length settings.
Commands to Configure the Burst
These commands configure global settings for the tool to optimize the AXI4 interface for the system in which it will operate. The efficiency of the operation depends on these values being set accurately. The provided default values are conservative, and might require changing depending on the memory access profile of your design.
Vitis HLS Command | Value | Description |
---|---|---|
syn.interface.m_axi_conservative_mode
|
bool default=true |
Delay M-AXI each write request until the associated write data are entirely available (typically, buffered into the adapter or already emitted). This can slightly increase write latency but can resolve deadlock due to concurrent requests (read or write) on the memory subsystem. |
syn.interface.m_axi_latency
|
uint 0 is auto default=0 (for Vivado IP flow) default=64 (for Vitis Kernel flow) |
Provide the scheduler with an expected latency for M-AXI accesses. Latency is the delay between a read request and the first read data, or between the last write data and the write response. Note that this number need not be exact, underestimation makes for a lower-latency schedule, but with longer dynamic stalls. The scheduler will account for the additional adapter latency and add a few cycles. |
syn.interface.m_axi_min_bitwidth
|
uint default=8 |
Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does not necessarily increase throughput if the actual accesses are smaller than the required interface. |
syn.interface.m_axi_max_bitwidth
|
uint default=1024 |
Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does decrease throughput if the actual accesses are bigger than the required interface as they will be split into a multi-cycle burst of accesses. |
syn.interface.m_axi_max_widen_bitwidth
|
uint default=0 (for Vivado IP flow) default=512 (for Vitis Kernel flow) |
Allow the tool to automatically widen bursts on M-AXI interfaces up to the chosen bitwidth. Must be a power-of-two between 8 and 1024. Note that burst widening requires strong alignment properties (in addition to burst). |
syn.interface.m_axi_auto_max_ports
|
bool default=false |
If the option is false, all the M-AXI interfaces that are not explicitly bundled will be bundled into a single common interface, thus minimizing resource usage (single adapter). If the option is true, all the M-AXI interfaces that are not explicitly bundled will be mapped into individual interfaces, thus increasing the resource usage (multiple adapters). |
syn.interface.m_axi_alignment_byte_size
|
uint default=1 (for Vivado IP flow) default=64 (for Vitis Kernel flow) |
Assume top function pointers that are mapped to M-AXI interfaces are at least aligned to the provided width in byte (power of two). This can help automatic burst widening. Warning: behavior will be incorrect if the pointers are not actually aligned at runtime. |
syn.interface.m_axi_num_read_outstanding
|
uint default=16 |
Default value for M-AXI num_read_outstanding interface parameter. |
syn.interface.m_axi_num_write_outstanding
|
uint default=16 |
Default value for M-AXI num_write_outstanding interface parameter. |
syn.interface.m_axi_max_read_burst_length
|
uint default=16 |
Default value for M-AXI max_read_burst_length interface parameter. |
syn.interface.m_axi_max_write_burst_length
|
uint default=16 |
Default value for M-AXI max_write_burst_length interface parameter. |