Error Reporting and Monitoring - 1.0 English

Versal Adaptive SoC Programmable Network on Chip and Integrated Memory Controller 1.0 LogiCORE IP Product Guide (PG313)

Document ID
PG313
Release Date
2023-11-01
Version
1.0 English

The HBM controller provides error logging functionality to maintain the integrity and reliability of the memory subsystem. The errors are reported for each pseudo channel by the NoC Agent units. They are reported through the corresponding REG_ISR registers in the HBM MC_NA register modules as mentioned in the Versal Adaptive SoC NoC and Integrated Memory Controller NPI Register Reference (AM019). There are two NoC Agents per controller (one for each HBM pseudo channel) and eight controllers per stack, thus, having a total of 16 HBMMC_NA register modules. REG_ISR is a sticky register and will get cleared once the PMC writes a 1 to the corresponding field.

The following table shows the different errors reported by the controller through the REG_ISR register.

Table 1. Error Logging in REG_ISR Register
Field Name Bit Error Condition Description
EHP0 0 1st Read Data Parity Error from HBM Memory First parity error encountered on a read from HBM memory. If read retry is disabled, the first parity error is reported by EHP1— in which case EHP0 will never trigger.
EHP1 1 2nd Read Data Parity Error from HBM Memory First read parity error was encountered and a read retry produced a parity error as well. However, if read retry is disabled, EHP1 is raised for the first parity error itself.
EHP2 2 Read Data Parity Error from Data Buffer Parity error encountered on data pulled from read data buffer while sending RRESP.
EHP3 3 Single bit Read Data ECC Error from HBM Single bit correctable error detected on a read data from the HBM memory.
EHP4 4 Double bit Read Data ECC Error from HBM Double-bit uncorrectable error detected on read data from the HBM memory.
EHP5 5 1st Write Data Parity Error from HBM First parity error detected on a data written to memory. If write retry is disabled, EHP6 will trigger instead.
EHP6 6 2nd Write Data Parity Error from HBM First write data parity error was retried, the write retry also produced a parity error. If write retry is disabled, the first parity error will produce EHP6 instead of EHP5.
EHP7 7 Write Data Parity Error from Data Buffer Parity error is found on data pulled from the write buffer to write to the HBM memory.
EHP8 8 NoC Data Poison NoC data flit arrived with data poison bit set. The NoC switch should never send out a flit with data poison set. In case it does, this error will be raised.
EHP9 9 NoC Command Poison NoC data flit arrived with command poison bit set. The NoC switch should never send out a flit with command poison bit set. In case it does, this error will be raised.
EHP10 10 Read Data Parity Error at NoC asynchronous FIFO Parity error found on data pulled from the asynchronous read response FIFO when sending RRESP over NoC. A SLVERR is forced on the outgoing response flit.
EHP11 11 Write Data Parity Error at NoC asynchronous FIFO Parity error found on data pulled from the asynchronous write data FIFO to forward to the write buffer. If HBM ECC is enabled, a parity error is injected in the data buffer and the data path then injects a 2 bit ECC error when writing the data to memory. If ECC is disabled, a parity error is injected in the data buffer and the data path then sets all the data masks to disable the write.
EHP12 12 Write Data Flit ECC Error A write data flit was received with uncorrectable ECC error. If HBM ECC is enabled, a parity error is injected in the data buffer and the data path then injects a 2 bit ECC error when writing the data to memory. If ECC is disabled, a parity error is injected in the data buffer and the data path then sets all the data masks to disable the write.
EHP13 13 Header flit with AxLen greater than 15 or write data flit count does not match expected AxLen Error is raised if AXI length on the header flit is greater than 15, or in a write transaction, the number of write data flits does not match the AXI length. The HBM controller goes into an undefined state if either occurs.
EHP14 14 NPP flit received on an unmapped virtual channel Triggered when a NoC flit is received on an unconfigured VC. The HBM controller goes into an undefined state.
EHP15 15 Wrap Transaction with Invalid Burst Length Error raised when a wrap transaction is received with an invalid AxLen. The transaction is dropped. The HBM controller goes into an undefined state.
EHP16 16 Parity Error on an Egress flit Parity error in either BRESP or the RRESP control signals. If BRESP has error, WRITE bit is set in the log register. Similarly, READ bit is set in the log register if RRESP has the error. The HBM controller goes into an undefined state.
EHP17 17 Ingress Credit Overflow Error raised when zero credits available on the ingress path, but a flit is still received. The HBM controller goes into an undefined state.
EHP18 18 Destination ID Check Destination ID received with the flit does not match the destination ID of the NSU port. The HBM controller goes into an undefined state.
EHP19 19 Received Credit Ready with Ready De-asserted Error raised when HBM controller's NSU egress port receives a credit return when it is not ready to receive (detected during initialization of controller). The HBM controller does not take any action.
EHP20 20 Uncorrectable ECC Error on Header flit Uncorrectable ECC error found on a header flit. The command and data are dropped in the NPP and no response is sent.
EHP21 21 Command Parity Error Parity error detected command FIFO data. The controller goes into an undefined state.
EHP22 22 XMPU Violation on a Transaction XMPU access violation in the transaction.
EHP23 23 HBM Command Parity Error HBM Memory detects parity error on command AERR raised by the HBM memory.
EHP33 27 Correctable ECC Error on an Incoming Flit Correctable ECC error found on an incoming flit.