Performance and Resource Utilization - 5.0 English - PG302

QDMA Subsystem for PCI Express Product Guide (PG302)

Document ID
PG302
Release Date
2024-12-18
Version
5.0 English

Performance

QDMA performance and detailed analysis is available in AR 71453.

AMD provides two example designs for you to experiment with. Standard example design is for functional test only. To generate an example design for performance analysis, use the following Tcl command to generate a performance example design:
set_property CONFIG.performance_exdes {true} [get_ips qdma_0]

Following are the QDMA register settings recommended by AMD for better performance. Performance numbers can vary based on systems and OS used.

Table 1. QDMA Performance Registers
Address Name Fields Field Value Register Value
0xB08 PFCH CFG
  • evt_pfch_fl_th[15:0]
  • pfch_fl_th[15:0]
  • 256
  • 256
0x100_0100
0xA80 PFCH_CFG_1
  • evt_qcnt_th[15:0]
  • pfch_qcnt[15:0]
  • 60
  • 60
0x3c_003c
0xA84 PFCH_CFG_2
  • fence
  • rsvd[1:0]
  • var_desc_no_drop
  • pfch_ll_sz_th[15:0]
  • var_desc_num_pfch[5:0]
  • num_pfch[5:0]
  • 1
  • 0
  • 0
  • 1024
  • 15
  • 8
0x8040_03C8
0x147C PFCH_CFG_3
  • rsvd[4:0]
  • var_desc_fl_free_cnt_th[8:0]
  • var_desc_lg_pkt_cam_cn_th[6:0]
  • 0
  • 256
  • 0
0x8000
0x1484 PFCH_CFG_4
  • glb_evt_timer_tick[14:0]
  • disable_glb_evt_timer
  • evt_timer_tick[14:0]
  • disable_evt_timer
  • 64
  • 0
  • 400
  • 0
0x80_0320
0x1400 CRDT_COAL_CFG_1
  • rsvd[12:0]
  • dis_fence_fix
  • pld_fifo_th[7:0]
  • crdt_timer_th[9:0]
  • NA
  • 0
  • 16
  • 16
0x4010
0x1404 CRDT_COAL_CFG_2
  • rsv2[7:0]
  • crdt_fifo_th[7:0]
  • rsv1[4:0]
  • crdt_cnt_th[10:0]
  • NA
  • 56
  • NA
  • 96
0x38_0060
0x15C GLBL_RRQ_PCIE_THROT
  • req_throt_en
  • req_throt
  • dat_throt_en
  • dat_throt
  • 0
  • 192
  • 1
  • 20480
0x604_5000
0x160 GLBL_RRQ_AXIMM_THROT
  • req_throt_en
  • req_throt
  • dat_throt_en
  • dat_throt
  • 0
  • 0
  • 0
  • 0
0
0x158 GLBL_RRQ_BRG_THROT
  • req_throt_en
  • req_throt
  • dat_throt_en
  • dat_throt
  • 1
  • 192
  • 1
  • 20480
0x8604_5000
0xE24 H2C_REQ_THROT_PCIE
  • req_throt_en_req
  • req_throt
  • req_throt_en_data
  • data_thresh
  • 1
  • 192
  • 1
  • 24576
0x8604_6000
0xE2C H2C_REQ_THROT_AXIMM
  • req_throt_en_req
  • req_throt
  • req_throt_en_data
  • data_thresh
  • 1
  • 64
  • 1
  • 16384
0x8204_4000
0x12EC H2C_MM_DATA_THROT
  • data_throt_en
  • data_throt
  • 1
  • 20480
0x1_5000
0x250 QDMA_GLBL_DSC_CFG
  • c2h_uodsc_limit (Soft IP)
  • h2c_uodsc_limit (Soft IP)
  • uodsc_limit (KS-B)
  • Max_dsc_fetch
  • wb_acc_int
  • 5
  • 8
  • NA
  • 2
  • 5
0x50_2015
0x4C CONFIG_BLOCK_MISC_CONTROL
  • 10b_tag_en
  • num_tags
  • rq_metering_multiplier
  • 0
  • 256
  • 9
0x1_0009
  • QDMA_C2H_INT_TIMER_TICK (0xB0C) set to 25. Corresponding to 100 ns (1 tick = 4 ns for 250 MHz user clock)
  • C2H trigger mode set to user timer, with counter set to 64 and timer to match round trip latency. Global register for timer should have a value of 30 for 3 μs.
  • TX/RX API burst size = 64, ring depth = 2048. The driver should update TX/RX PIDX in batches of 64.
  • PCIe MPS = 256 bytes, MRRS >= 512 bytes, Extended Tag Enabled, Relaxed Ordering Enabled
  • The driver will update the completion CIDX in batches of 64 to reduce number of MMIO writes before updating the C2H PIDX
  • The driver should update the H2C PIDX in batches of 64, and also update for the last descriptor of the scatter gather list.
  • C2H context:
    • bypass = 0 (Internal mode)
    • frcd_en = 1
    • qen = 1
    • wbk_en = 1
    • irq_en = irq_arm = int_aggr = 0
  • C2H prefetch context:
    • pfch = 1
    • bypass = 0
    • valid = 1
  • C2H CMPT context:
    • en_stat_desc = 1
    • en_int = 0 (Poll_mode)
    • int_aggr = 0 (Poll mode)
    • trig_mode = 4
    • counter_idx = corresponding to 64
    • timer_idx = corresponding to 3 μs
    • valid = 1
  • H2C context:
    • bypass = 0 (Internal mode)
    • frcd_en = 0
    • fetch_max = 0
    • qen = 1
    • wbk_en = 1
    • wbi_chk = 1
    • wbi_intvl_en = 1
    • irq_en = 0 (Poll mode)
    • irq_arm = 0 (Poll mode)
    • int_aggr = 0 (Poll mode)

For optimal QDMA streaming performance, packet buffers of the descriptor ring should be aligned to at least 256 bytes.

Performance in Descriptor Bypass Mode

When the design is configured in descriptor bypass mode, all the above setting apply. The following information provides recommendations to improve performance in bypass mode.

  • When bypass in h2c_byp_in_st_sdi ports is set, the QDMA IP generates the status write back for every packet. AMD recommends that this port be asserted once in 32 packets or 64 packets. And if there are no more descriptors left then assert h2c_byp_in_st_sdi at the last descriptor. This requirement is per queue basis, and applies to AXI4 (H2C and C2H) bypass transfers and AXI4-Stream H2C transfers.
  • For AXI4-Stream C2H Simple bypass mode, the dsc_crdt_in_fence port should be set to 1 for performance reasons. This recommendation assumes the user design already coalesced credits for each queue and sent them to the IP. In internal mode, set the fence bit in the QDMA_C2H_PFCH_CFG_2 (0xA84) register.

Performance Optimization Based on Available Cache/Buffer Size

Table 2. QDMA Soft IP
Name Entry/Depth Description
C2H descriptor cache depth 1024 Total number of outstanding C2H stream descriptor fetches for cache bypass and internal. This cache depth is not relevant in simple bypass mode, in simple bypass mode you can have longer descriptor cache.
Prefetch cache depth 64 C2H prefetch tags available. If you have more then 64 active queues for packets < 512 B, performance can reduce depending on the data pattern. If you see performance degradation, you can implement simple bypass mode, where you can maintain the descriptor flow.
C2H payload FIFO depth 512 Units of 64 B. The amount of C2H data that C2H engine can buffer. This amount of buffer can sustain the host read latency up to 2 us (512 *4 ns). If latency is more then 2 us, there could be performance degradation.
Common reorder buffer depth 512 Units of 64 B for Soft IP. Shared buffer space that can be flexibly allocated between the read engines. Throttle CSRs can be used to limit the amount of outstanding read data used by each engine in this common buffer space.

Resources Utilization

For QDMA Resource Utilization, see Resource Use web page.