AI Engine Error Events - 2023.1 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2023-06-23
Version
2023.1 English

This section provides error and related debug information for the errors obtained using the XRT error reporting APIs described previously. These are errors propagated from the AI Engine array and can be used to debug application specific errors in hardware.

For errors with class XRT_ERROR_CLASS_AIE, as found in https://github.com/Xilinx/XRT/blob/master/src/runtime_src/core/include/xrt_error_code.h, you can obtain additional information by enabling the dmesg logs, which provide the causes of the error (and are described in the following tables). An example log is shown here:

[18.462615] aie aie0: Asserted tile error event 56 at col 6 row 7
[18.471397] aie aie0: Asserted tile error event 60 at col 25 row 1
Note: Note the tile location is indicated by the col and row number. Row 0 is the SHIM (interface) tile, AI Engines start from row 1.

The following tables list the various categories of error, in addition to the exact error number, description, and tips on the next steps to debug and resolve the errors.

Table 1. CORE Module Error Events
Error Group No. Name Description Debug Tips
Instruction Errors 59 Instruction Decompression Error Event generated when AI Engine cannot decompress instruction fetched. This can happen if the program instructions are corrupt. Validate ELF generation. Regenerate the ELF file with the Vitis compiler (V++) --package command. If the issue persists, contact AMD support.
Access Errors 55 PM Reg Access Failure This error can happen on bank access conflict to PM by the memory mapped AXI interface and AI Engine. Contact AMD support.
60 DM address out of range Event generated if AI Engine tries to access a memory location outside of 0x20000 – 0x3FFFF. Run AI Engine simulator (aiesimulator) with –-enable-memory-check that will flag any access violations.

Alternatively run x86simulator with --valgrind that will flag any access violations.

65 PM address out of range Event generated if PC is out of range Run AI Engine simulator (aiesimulator) with – enable-memory-check that will flag any access violations. Alternatively run x86simulator with --valgrind that will flag any access violations.
66 DM access to unavailable Event generated if AI Engine issues an access to the isolated tile in neighborhood. Check if the kernel runs on AI Engine accesses data memory of an isolated tile (a different partition).

If the issue persists, contact AMD support.

Bus Errors 58 AXI MM Slave Error Event generated if the memory mapped AXI interface slave read/write request is for an address which does not exist in the AI Engine tile. If the PL IP is accessing the AI Engine registers using the memory mapped AXI interface, check the PL IP to see if it access invalid registers.

If the issue persists, contact AMD support.

Stream Errors 54 TLAST in WSS words 0-2 Event generated if TLAST is not on the fourth word of a wide stream. If PL IP is used to generate the stream, check if it generates TLAST correctly.

If the issue persists, contact AMD support.

56 Stream Pkt Parity Error

Event generated if there is any parity error in the packet header.

Check the data source such as PL IP which generates the packets to see if the packet is valid and if the parity bit is correctly calculated. If the data is from PL IP, check the packet header generated from the PL IP.
57 Control Pkt Error Control Packet Error Check the data source, such as PL IP which generates the packets to see if it generates the packets correctly.

If the issue persists, contact AMD support.

ECC Errors 64 PM ECC Error 2bit Event generated when 2 bit ECC error is detected Re-run the application.

If the issue persists, contact AMD support.

62 PM ECC Error Scrub 2bit Event generated if ECC scrubber detects 2 Bit ECC error Re-run the application.

If the issue persists, contact AMD support.

Lock Errors 67 Lock Access to unavailable Event generated if AI Engine issues an access to the isolated tile in neighborhood. Run AI Engine simulator (aiesimulator) with –-enable-memory-check that will flag any access violations. If the issue persists, contact AMD support. Alternatively run x86simulator with --valgrind that will flag any access violations.
  1. CORE refers to the AI Engine in the AI Engine tile.
Table 2. MEMORY Module Error Events
Errors Group No. Name Description Debug Tips
ECC Errors 88 DM ECC Error Scrub 2bit Event generated when ECC scrubber detects 2-bit ECC error in bank 0 or bank 1 of DM. Re-run the application.

If the issue persists, contact AMD support.

90 DM ECC Error 2bit Event generated when 2-bit ECC error is detected during access to bank 0 or 1 of DM. This data memory ECC error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface. Re-run the application.

If the issue persists, contact AMD support.

Memory Parity Errors 91 DM Parity Error Bank 2 Event generated when a parity error is detected during access to DM bank 2.

This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.

Re-run the application.

If the issue persists, contact AMD support.

92 DM Parity Error Bank 3 Event generated when a parity error is detected during access to DM bank 3.

This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.

Re-run the application.

If the issue persists, contact AMD support.

93 DM Parity Error Bank 4 Event generated when a parity error is detected during access to DM bank 4.

This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.

Re-run the application.

If the issue persists, contact AMD support.

94 DM Parity Error Bank 5 Event generated when a parity error is detected during access to DM bank 5.

This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.

Re-run the application.

If the issue persists, contact AMD support.

95 DM Parity Error Bank 6 Event generated when a parity error is detected during access to DM bank 6.

This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.

Re-run the application.

If the issue persists, contact AMD support.

96 DM Parity Error Bank 7 Event generated when a parity error is detected during access to DM bank 7.

This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.

Re-run the application.

If the issue persists, contact AMD support.

DMA Errors 97 DMA S2MM 0 Error This error can be caused by writing to the BD task queue of S2MM channel 0 when it is full. If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If the issue persists, contact AMD support.

98 DMA S2MM 1 Error This error can be caused by writing to the BD task queue of S2MM channel 1 when it is full. If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If the issue persists, contact AMD support.

99 DMA MM2S 0 Error This error can be caused by writing to the BD task queue of MM2S channel 0 when it is full. If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If the issue persists, contact AMD support.

100 DMA MM2S 1 Error

This error can be caused by writing to the BD task queue of MM2S channel 1 when it is full.

If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If the issue persists, contact AMD support.

Table 3. SHIM Module Error Events
Error Group No. Name Description Debug Tips
Bus Errors 62 AXI MM Slave Tile Error Event generated if a memory mapped AXI interface slave request comes to an interface tile but the address is invalid. If using the PL IP to access the AI Engine register with the memory mapped AXI interface, check if the IP tries to access the wrong address.

If the issue persists, contact AMD support.

64 AXI MM Decode NSU Error The memory mapped AXI interface traffic internally has responded with a DECERR. For example, if a column, set of tiles are clock gated, a decode error is generated internally and travels on the memory mapped AXI interface to the interface tile to generate this event. If using the PL IP to access the AI Engine register using the memory mapped AXI interface, check if the IP tries to access tile which is gated.

If the issue persists, contact AMD support.

65 AXI MM Slave NSU Error The memory mapped AXI interface traffic internally has responded with a SLVERR. For example, an AI Engine tile in that interface tile column has responded with a slave error. That slave error will travel over the memory mapped AXI interface to the interface tile as a slave error. If using the PL IP to access the AI Engine register with the memory mapped AXI interface, check if the IP tries to access wrong address.

If the issue persists, contact AMD support.

66 AXI MM Unsupported Traffic The memory mapped AXI interface from the NoC has made a request that the AI Engine does not support. If using the PL IP to access the AI Engine register with the memory mapped AXI interface, check if the IP generates unsupported memory mapped AXI interface requests.
67 AXI MM Unsecure Access in Secure Mode The memory mapped AXI interface from the NoC is violating the secure mode (trying to route unsecured traffic when AI Engine only supports secure traffic). Check if the AI Engine array is configured in secure mode.
68 AXI MM Byte Strobe Error The memory mapped AXI interface from the NoC is writing with non-complete 32-bit words (within a 32- bit word all byte strobes must be set). If the PL IP is accessing the AI Engine using the memory mapped AXI interface, check if all byte strobes are set for a 32-bit word.
Stream Error 63 Control Pkt Error Control Packet Error If the PL IP is generating the control packets, check if the IP generates packets properly.

If the issue persists, contact AMD support.

DMA Error 69 DMA S2MM 0 Error This DMA error is for DMA S2MM channel 0. It can be caused by:
  • writing to the BD task queue when it is full;
  • decode error when it tries to access the memory
  • slave error when it tries to access the memory
If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If you manage buffer descriptors in your application, check if the memory address sent to the interface tile DMA buffer descriptor is invalid.

If the issue persists, contact AMD support.

70 DMA S2MM 1 Error This DMA error is for DMA S2MM channel 1. It can be caused by:
  • writing to the BD task queue when it is full;
  • decode error when it tries to access the memory
  • slave error when it tries to access the memory
If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If you manage buffer descriptors in your application, check if memory address sent to the interface tile DMA buffer descriptor is invalid.

If the issue persists, contact AMD support.

71 DMA MM2S 0 Error This DMA error is for DMA MM2S channel 0. It can be caused by:
  • writing to the BD task queue when it is full;
  • decode error when it tries to access the memory
  • slave error when it tries to access the memory
If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If you manage buffer descriptors in your application, check if memory address sent to the interface tile DMA buffer descriptor is invalid.

If the issue persists, contact AMD support.

72 DMA MM2S 1 Error This DMA error is for DMA MM2S channel 1. It can be caused by:
  • writing to the BD task queue when it is full;
  • decode error when it tries to access the memory
  • slave error when it tries to access the memory
If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full.

If you manage buffer descriptors in your application, check if memory address sent to the interface tile DMA buffer descriptor is invalid.

If the issue persists, contact AMD support.

  1. SHIM refers to the interface tiles in the AI Engine array.