AI Engine Error Reporting - 2025.2 English - UG1076

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2025-11-20
Version
2025.2 English

XRT provides error reporting APIs and tools. The are two types of errors, as follows:

Synchronous error
Errors detectable during the XRT runtime function call.
Asynchronous error
Errors from the underneath driver, system, hardware, etc.
Following is a synchronous error handling example:
auto ghdl=xrt::graph(device,uuid,"gr");
try{
  ghdl.update("gr.fir24.in[1]",narrow_filter);
  ghdl.run(16);
  ghdl.read("gr.fir24.inout[0]", coeffs_readback);//Async read
}catch(std::exception const& e){
  std::cout<<"Graph Execution Error"<<std::endl;
  return 1;
}

An asynchronous error can occur independently of the current XRT function call or the running application. Asynchronous errors are cached in driver subsystems and can be accessed by the user application through the asynchronous error reporting APIs. Cached errors are persistent until explicitly cleared.

Persistent errors are not necessarily indicative of the current system state. For example, a reset board can function correctly while previously cached errors are still available. To avoid current state confusion, asynchronous errors have a timestamp attached indicating when the error occurred. The timestamp can be compared to, for example, the timestamp for the last xrt-smi reset.

The errors cached by the driver contain a system error code and additional meta data, as defined in the xrt_error_code.h file in the XRT Repository. The user space and the kernel space share this information.

The XRT error handling APIs can refer to experimental/xrt_error.h. An asynchronous error handling example:

xrt::error error(device, XRT_ERROR_CLASS_AIE);
auto errCode = error.get_error_code();
auto timestamp = error.get_timestamp();
auto err_str = error.to_string();
/* code to deal with this specific error */
std::cout<<"Async error: "<< err_str << std::endl;

Following is an example asynchronous error output:


Error Number (6): AIE_ACCESS
Error Driver (4): DRIVER_AIE
Error Severity (3): SEVERITY_CRITICAL
Error Module (3): MODULE_AIE_CORE
Error Class (2): CLASS_AIE
Timestamp: 1637342412366664740

XRT maintains the latest error for each class and an associated timestamp. You can use the xrt_error_code.h file in the XRT Repository to interpret error information. For example, Error Module (3): MODULE_AIE_COREcorresponds to XRT_ERROR_MODULE_AIE_CORE in enumeration xrtErrorModule.

You can use xrt-smi to report errors. The error report accumulates all the errors from the various classes and sorts them by timestamp. The report queries the drivers to determine the time of the last reset request.

$ xrt-smi examine -r error -d 0               

Asynchronous Errors
Time Class Module Driver Severity Error Code
Fri Nov 19 17:19:42 2021 GMT CLASS_AIE MODULE_AIE_CORE DRIVER_AIE SEVERITY_CRITICAL AIE_ACCESS


$ xrt-smi examine -r error -f json -o <OUTPUT_FILE> -d 0
{
  "schema_version": {
    "schema": "JSON",
    "creation_date": "Fri Nov 19 17:58:09 2021 GMT"
  },
  "devices": [
    {
      "interface_type": "pcie",
      "device_id": "0000:00:00.0",
      "asynchronous_errors": [
        {
          "time": {
            "epoch": "1637342382770339700",
            "timestamp": "Fri Nov 19 17:19:42 2021 GMT"
          },
          "class": "CLASS_AIE",
          "module": "MODULE_AIE_CORE",
          "severity": "SEVERITY_CRITICAL",
          "driver": "DRIVER_AIE",
          "error_code": {
          "error_id": "6",
          "error_msg": "AIE_ACCESS"
          }
        }
      ]
    }
  ]
}

You can also use xrt-smi to report AI Engine running status and read registers for debug purposes. For example, the following command reads the status of kernels after the graph has executed.

$ xrt-smi examine -r aie -d 0

--------------------------
1/1 [0000:00:00.0] : edge
--------------------------
Aie
  Aie_Metadata
  GRAPH[ 0] Name : gr
          Status : unknown
    SNo. Core [C:R] Iteration_Memory [C:R] Iteration_Memory_Addresses 
    [ 0] 23:1 23:1 16388 
    [ 1] 23:2 23:0 6980 
    [ 2] 23:3 23:1 4 
    [ 3] 24:1 24:0 4 
    [ 4] 24:2 24:2 4 
    [ 5] 24:3 24:1 4 
    [ 6] 25:1 25:1 4 


Core [ 0]
  Column : 23
  Row : 1
  Core:
    Status : disabled, core_done
    Program Counter : 0x00000308
    Link Register : 0x00000290
    Stack Pointer : 0x000340a0
  DMA:
    MM2S:
      Channel:
        Id : 0
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

    S2MM:
      Channel:
        Id : 0
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

  Locks:
    0 : released_for_write
    1 : released_for_write
    2 : released_for_write
    3 : released_for_write
    4 : released_for_write
    5 : released_for_write
    6 : released_for_write
    7 : released_for_write
    8 : released_for_write
    9 : released_for_write
    10 : released_for_write
    11 : released_for_write
    12 : released_for_write
    13 : released_for_write
    14 : released_for_write
    15 : released_for_write


  Events:
    core : 1, 2, 5, 22, 23, 24, 28, 29, 31, 32, 35, 36, 38, 39, 40, 44, 45, 47, 68
    memory : 1, 43, 44, 45, 106, 113

......


Core [ 6]
  Column : 25
  Row : 1
  Core:
    Status : enabled, east_lock_stall
    Program Counter : 0x000001e6
    Link Register : 0x000000b0
    Stack Pointer : 0x00030020
  DMA:
    MM2S:
      Channel:
        Id : 0
        Channel Status : stalled_on_requesting_lock
        Queue Size : 0
        Queue Status : okay
        Current BD : 2

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

    S2MM:
      Channel:
        Id : 0
        Channel Status : running
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0


  Locks:
    0 : acquired_for_write
    1 : released_for_write
    2 : released_for_write
    3 : released_for_write
    4 : released_for_write
    5 : released_for_write
    6 : released_for_write
    7 : released_for_write
    8 : released_for_write
    9 : released_for_write
    10 : released_for_write
    11 : released_for_write
    12 : released_for_write
    13 : released_for_write
    14 : released_for_write
    15 : released_for_write

  Events:
    core : 1, 2, 5, 22, 26, 28, 29, 31, 32, 35, 38, 39, 44
    memory : 1, 20, 21, 23, 35, 43, 44, 106, 113

Use following command to read specific registers for debug purposes.

$ xrt-smi advanced --read-aie-reg -d 0 0 25 Core_Status 
Register Core_Status Value of Row:0 Column:25 is 0x00000201

For AI Engine register definitions, see the Versal Adaptive SoC AI Engine Register Reference (AM015).

For details on xrt-smi command use, see Xilinx Runtime (XRT) Architecture.

For error analysis in the Vitis IDE, see Analyzing AI Engine Status.