Performance
QDMA performance and detailed analysis is available in AR 71453.
AMD provides multiple example designs for you to experiment. All example designs can be downloaded from GitHub. Performance example design can be selected from the Vivado Store.
Following are the QDMA register settings recommended by AMD for better performance. Performance numbers vary depending on systems and OS used.
Address | Name | Fields | Field Value | Register Value |
---|---|---|---|---|
0xB08 | PFCH CFG |
|
|
0x100_0100 |
0xA80 | PFCH_CFG_1 |
|
|
0x78_007c |
0xA84 | PFCH_CFG_2 |
|
|
0x8040_03C8 |
0x1400 | CRDT_COAL_CFG_1 |
|
|
0x4010 |
0x1404 | CRDT_COAL_CFG_2 |
|
|
0x78_0060 |
0xE24 | H2C_REQ_THROT_PCIE |
|
|
0x8E04_E000 |
0xE2C | H2C_REQ_THROT_AXIMM |
|
|
0x8E05_0000 |
0x250 | QDMA_GLBL_DSC_CFG |
|
|
0x00_0015 |
0x4C | CONFIG_BLOCK_MISC_CONTROL |
|
|
0x80_001f |
- QDMA_C2H_INT_TIMER_TICK (0xB0C) set to 50. Corresponding to 100 ns (1 tick = 4 ns for 250 MHz user clock).
- C2H trigger mode set to User + Timer, with user set to 64 and timer to match round trip latency. Global register for timer should have a value of 30 for 3 μs.
- TX/RX API burst size = 64, ring depth = 2048. The driver should update TX/RX PIDX in batches of 64.
- PCIe MPS = 256 bytes, MRRS 4K bytes, 10-bit Tag Enabled, Relaxed Ordering Enabled.
- The driver updates the completion CIDX in batches of 64 to reduce the number of MMIO writes before updating the C2H PIDX
- The driver should update the H2C PIDX in batches of 64, and also update for the last descriptor of the scatter gather list.
- C2H context:
-
bypass
= 0 (Internal mode) -
frcd_en
= 1 -
qen
= 1 -
wbk_en
= 1 -
irq_en
=irq_arm
=int_aggr
= 0
-
- C2H prefetch context:
-
pfch
= 1 -
bypass
= 0 -
valid
= 1
-
- C2H CMPT context:
-
en_stat_desc
= 1 -
en_int
= 0 (Poll_mode) -
int_aggr
= 0 (Poll mode) -
trig_mode
= 5 -
user_idx
= corresponding to 64 -
timer_idx
= corresponding to 3 μs -
valid
= 1
-
- H2C context:
-
bypass
= 0 (Internal mode) -
frcd_en
= 0 -
fetch_max
= 0 -
qen
= 1 -
wbk_en
= 1 -
wbi_chk
= 1 -
wbi_intvl_en
= 1 -
irq_en
= 0 (Poll mode) -
irq_arm
= 0 (Poll mode) -
int_aggr
= 0 (Poll mode)
-
For optimal QDMA streaming performance, packet buffers of the descriptor ring should be aligned to at least 256 bytes.
Performance in Descriptor Bypass Mode
QDMA supports both internal mode and descriptor bypass mode. Depending on the number of active queues needed for the design, you need to select the Internal mode or Descriptor bypass mode. If the number of active queues are less than 32, then Internal mode works fine. If the number of queues are more than 32, it is better to use the descriptor bypass mode.
In the descriptor bypass mode, it is your responsibility to maintain descriptors for corresponding queues and need to control their priority in sending the descriptors back to the IP.
When the design is configured in the descriptor bypass mode, all the above settings apply. The following information provides recommendations to improve performance in the bypass mode:
- When bypass in
dma<0/1>_h2c_byp_in_st_sdi
ports is set, the QDMA IP generates the status write back for every packet. recommends that this port be asserted once in 32 packets or in 64 packets. And if there are no more descriptors left, then asserth2c_byp_in_st_sdi
port at the last descriptor. This requirement is per queue basis, and applies to AXI4 (H2C and C2H) bypass transfers and AXI4-Stream H2C transfers. - For AXI4-Stream, C2H Simple bypass
mode, the
dma<0/1>_dsc_crdt_in_fence
port should be set to 1 for performance reasons. This recommendation assumes that your design is already coalesced credits for each queue and sent them to the IP. In an internal mode, set the fence bit in the QDMA_C2H_PFCH_CFG_2 (0xA84) register.