Performance bottlenecks can occur in various places. Most issues can be categorized into four separate debug sections:
- Masters
- The device that is initiating the transfers.
- Check their capabilities. This is achieved by calculating the AXI data bus width * AXI frequency of that Master.
- If CPM is mastering:
- Check if it is using both AXI4-MM ports or just a single AXI4-MM port. Calculate the aggregated bandwidth through these ports.
- Check CPMTOPSWCLK frequency. CPM AXI4-MM ports are fixed 128-bit wide.
- Check traffic pattern. Make sure packets are split by transfer IDs (can be Queue IDs or Channel IDs). Do not split packet using the same AXI4 IDs.
- Slaves
- The device that is receiving the transfers.
- Check their capabilities. This is achieved by calculating the AXI4 data bus width * AXI4 frequency of that Slave.
- If CPM is receiving:
- CPM only has one AXI4-MM port. Calculate the bandwidth through this port.
- Check CPMTOPSWCLK frequency. CPM AXI4-MM ports are fixed 128-bit wide.
- Check traffic pattern. Slave port is used to access internal registers, which internally uses AXI4-Lite interface with one AXI4 outstanding transaction, and also to Bus Master (Read/Write) to PCIe link. Do not interleave these transaction destinations to avoid Head of Line blocking.
- Interconnects
- All interconnects and switches the packet has to go through.
- Analyze all interconnects in the transfer path. It can be NoC, CCI-500, SmartConnect, etc. Check their throughput capabilities. This is achieved by calculating the AXI4 data bus width * AXI4 frequency of those interconnects.
- Check the AXI4 outstanding transactions settings. The higher your system's latency or the smaller your AXI4 packet is, the higher you want this number to be to avoid credit starvation.
- Check traffic pattern. Ensure that "slow" and "fast" datapaths do not interleave to avoid Head of Line blocking.
- External/Software
- Factors that are outside of the Xilinx device.
- Software/Driver/Apps
- Software and drivers are typically much slower than hardware. To
maximize throughput, you must ensure that there is minimal
"maintenance" required from software during transfers:
- Maximize available Descriptor Queue ring size or Descriptor Chain size.
- Maximize Transfer Size (including Max Payload Size and Max Read Request Size settings at the host).
- Maximize number of Queues and DMA Channels.
- Minimize Interrupts and avoid excessive use of Poll Mode.
- Minimize Pointer or hardware updates.
- Avoid Bus Mastering from software. Let hardware do DMA or Bus Mastering.
- Avoid excessive copy from user to kernel level memory. Pin a memory at the host to be used for transfers.
- Switches/IOMMU/Processor Links
- When
transactions are going in and out of the host, there are various
common modules along the paths that can increase latency and
therefore your overall throughput:
- Pick a path that has minimal PCIe switches. Analyze the available PCIe slots and their bus topology. Ensure the software or driver is running at the CPU attached to that PCIe bus. Use memory devices (disks, DDR memory, etc.) that are directly attached to that CPU.
- Pick CPUs and PCIe switches that have high numbers of PCIe credits or ones that can use Extended PCIe Tags. This greatly improves the amount of PCIe packets outstanding, which will be required to compensate the high latency at the host. Enterprise grade systems typically advertise higher values compared to desktop or workstation systems.
- Ensure the active PCIe slot and PCIe switch's link width and speed matches your device. Verify the PCIe link is trained to the optimum link speed and width.
- IOMMU is often required in multifunction devices, however it adds latency to translate those PCIe addresses. Avoid relying on IOMMU unless it is required.
- Disable Low Power State on PCIe and CPUs. While these features save power and can be significant in a large data center environment, repeatedly entering and exiting these power states can slow down transfers and increase latency.