To access and dump hardware debug counters for Mellanox RNIC, use the command:
$ cd /sys/class/infiniband/mlx5_0/ports/1/hw_counters/
$ for file in `ls .`; do echo -n "${file}:"; cat $file; done
Note: For more information on hardware debug counters for Mellanox, see DOC-2572 on the Mellanox community.
Some quick debug checks that you can do to ensure that the system is clean are listed here.
• Check the ethtool.log file for any link failures or CRC failures. Any non-zero value in these two counters points towards an unstable link and can be the cause of failure. Two snippets from the ethtool.log file are listed here.
Sample 1:
rx_wqe_err: 0
rx_mpwqe_filler: 0
rx_mpwqe_frag: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0
link_down_events_phy: 4
rx_out_of_buffer: 0
rx_vport_unicast_packets: 5
rx_vport_unicast_bytes: 422
tx_vport_unicast_packets: 10
tx_vport_unicast_bytes: 714
rx_vport_multicast_packets: 46
rx_vport_multicast_bytes: 7608
Sample 2:
rx_vport_rdma_multicast_bytes: 0
tx_vport_rdma_multicast_packets: 0
tx_vport_rdma_multicast_bytes: 0
tx_packets_phy: 320587336
rx_packets_phy: 324924215
rx_crc_errors_phy: 0
tx_bytes_phy: 27657272230
rx_bytes_phy: 467673988652
tx_multicast_phy: 59
tx_broadcast_phy: 32
rx_multicast_phy: 0
rx_broadcast_phy: 9
rx_in_range_len_errors_phy: 0
• Check the hw_counters on the initiator side. These counters give a picture of all fatal/non-fatal errors seen by the initiator. A sample of the counters:
duplicate_request:0
implied_nak_seq_err:0
lifespan:10
local_ack_timeout_err:0
out_of_buffer:0
out_of_sequence:0
packet_seq_err:0
rnr_nak_retry_err:0
rx_atomic_requests:0
rx_read_requests:5
rx_write_requests:108307999
• Check the following register locations from the ERNIC register dump. The QP Status (STATQPi) registers for all enabled QPs provide a status of the different QPs. Check if the QP FATAL status is set to 1 in any of the QP status registers. For example, the QP Status register for QP 5:
0x84020688: 30620601 QP Fatal is set to 1
• If the QP is in FATAL state, no transactions are performed from this QP and the QP gets disconnected. Bits[22:16] in the same register provide the last AETH syndrome received from the initiator. In many cases the QP might go into FATAL state due to a NAK syndrome received from the initiator. The NAK syndrome helps you understand the failure being seen by the initiator RNIC card. In the above example, the AETH syndrome of 0x62 indicates a “Remote Access Error” from the initiator. The decoding of the AETH syndrome is provided in the Infiniband Architecture Specification Volume 1 (Release 1.2.1) . For NAK code details in this specification, see Table 43: AETH Syndrome field and Table 44: NAK Codes.
• Check the Incoming and outgoing NAK count registers ((INNACKPKTCNT) and (OUTNACKPKTCNT)) at offset 0x134 and 0x138 for the number of incoming NAK syndromes seen and number of NAK syndromes sent out. This number should normally correlate with the number seen from the hw_counters seen at the initiator. In general not all NAK codes are fatal. However, all NAK codes lead to retries and can lower the overall performance of the system. A high number of NAK codes can be a cause of concern.
• The total number of retries initiated by the target can be known from the Retry count status register (RETRYCNTSTS) at offset 0x140 . Normally this number will match with the incoming NAK count. In case this number is more than the incoming NAK count value, it might be due to timeouts. Timeouts happen when the responder (in this case, the initiator RNIC) does not respond to a request in a given time. The timeout value is configured in the Timeout Configuration register (TIMEOUTCONF). This timeout interval is implemented as per the InfiniBand™ Architecture Specification Volume 1 (Release 1.2.1) clause C9-141. It might be worthwhile to try and increase the timeout interval and check if the number of retries is reduced.
• ERNIC register offset 0x6C (ERRBUFWPTR) indicates Error buffer write pointer. This register gives the number of error packets received. Each error packet will be stored in the address location provided in Error buffer base address (ERRBUFBA) register (offset 0x60 ). Each entry in this buffer will be given with error syndrome. See ERNIC RX Path for details. The rows highlighted in yellow enlist the conditions that will cause the QP to go into FATAL state.