XRT provides error reporting APIs and tools. The are two types of errors, as follows:
- Synchronous error
- Errors detectable during the XRT runtime function call.
- Asynchronous error
- Errors from the underneath driver, system, hardware, etc.
auto ghdl=xrt::graph(device,uuid,"gr");
try{
ghdl.update("gr.fir24.in[1]",narrow_filter);
ghdl.run(16);
ghdl.read("gr.fir24.inout[0]", coeffs_readback);//Async read
}catch(std::exception const& e){
std::cout<<"Graph Execution Error"<<std::endl;
return 1;
}
An asynchronous error can occur independently of the current XRT function call or the running application. Asynchronous errors are cached in driver subsystems and can be accessed by the user application through the asynchronous error reporting APIs. Cached errors are persistent until explicitly cleared.
Persistent errors are not necessarily indicative of the current system
state. For example, a reset board can function correctly while previously cached errors are
still available. To avoid current state confusion, asynchronous errors have a timestamp
attached indicating when the error occurred. The timestamp can be compared to, for example,
the timestamp for the last xrt-smi reset.
The errors cached by the driver contain a system error code and additional meta data, as defined in the xrt_error_code.h file in the XRT Repository. The user space and the kernel space share this information.
The XRT error handling APIs can refer to experimental/xrt_error.h. An asynchronous error handling example:
xrt::error error(device, XRT_ERROR_CLASS_AIE);
auto errCode = error.get_error_code();
auto timestamp = error.get_timestamp();
auto err_str = error.to_string();
/* code to deal with this specific error */
std::cout<<"Async error: "<< err_str << std::endl;
Following is an example asynchronous error output:
Error Number (6): AIE_ACCESS
Error Driver (4): DRIVER_AIE
Error Severity (3): SEVERITY_CRITICAL
Error Module (3): MODULE_AIE_CORE
Error Class (2): CLASS_AIE
Timestamp: 1637342412366664740
XRT maintains the latest error for each class and an associated timestamp.
You can use the xrt_error_code.h file in the XRT Repository to
interpret error information. For example, Error Module (3):
MODULE_AIE_COREcorresponds to XRT_ERROR_MODULE_AIE_CORE in enumeration xrtErrorModule.
You can use xrt-smi to report errors. The
error report accumulates all the errors from the various
classes and sorts them by timestamp. The report queries the drivers to determine the time of
the last reset request.
$ xrt-smi examine -r error -d 0
Asynchronous Errors
Time Class Module Driver Severity Error Code
Fri Nov 19 17:19:42 2021 GMT CLASS_AIE MODULE_AIE_CORE DRIVER_AIE SEVERITY_CRITICAL AIE_ACCESS
$ xrt-smi examine -r error -f json -o <OUTPUT_FILE> -d 0
{
"schema_version": {
"schema": "JSON",
"creation_date": "Fri Nov 19 17:58:09 2021 GMT"
},
"devices": [
{
"interface_type": "pcie",
"device_id": "0000:00:00.0",
"asynchronous_errors": [
{
"time": {
"epoch": "1637342382770339700",
"timestamp": "Fri Nov 19 17:19:42 2021 GMT"
},
"class": "CLASS_AIE",
"module": "MODULE_AIE_CORE",
"severity": "SEVERITY_CRITICAL",
"driver": "DRIVER_AIE",
"error_code": {
"error_id": "6",
"error_msg": "AIE_ACCESS"
}
}
]
}
]
}
You can also use xrt-smi to report
AI Engine running status and read registers for debug
purposes. For example, the following command reads the status of kernels after the graph has
executed.
$ xrt-smi examine -r aie -d 0
--------------------------
1/1 [0000:00:00.0] : edge
--------------------------
Aie
Aie_Metadata
GRAPH[ 0] Name : gr
Status : unknown
SNo. Core [C:R] Iteration_Memory [C:R] Iteration_Memory_Addresses
[ 0] 23:1 23:1 16388
[ 1] 23:2 23:0 6980
[ 2] 23:3 23:1 4
[ 3] 24:1 24:0 4
[ 4] 24:2 24:2 4
[ 5] 24:3 24:1 4
[ 6] 25:1 25:1 4
Core [ 0]
Column : 23
Row : 1
Core:
Status : disabled, core_done
Program Counter : 0x00000308
Link Register : 0x00000290
Stack Pointer : 0x000340a0
DMA:
MM2S:
Channel:
Id : 0
Channel Status : idle
Queue Size : 0
Queue Status : okay
Current BD : 0
Id : 1
Channel Status : idle
Queue Size : 0
Queue Status : okay
Current BD : 0
S2MM:
Channel:
Id : 0
Channel Status : idle
Queue Size : 0
Queue Status : okay
Current BD : 0
Id : 1
Channel Status : idle
Queue Size : 0
Queue Status : okay
Current BD : 0
Locks:
0 : released_for_write
1 : released_for_write
2 : released_for_write
3 : released_for_write
4 : released_for_write
5 : released_for_write
6 : released_for_write
7 : released_for_write
8 : released_for_write
9 : released_for_write
10 : released_for_write
11 : released_for_write
12 : released_for_write
13 : released_for_write
14 : released_for_write
15 : released_for_write
Events:
core : 1, 2, 5, 22, 23, 24, 28, 29, 31, 32, 35, 36, 38, 39, 40, 44, 45, 47, 68
memory : 1, 43, 44, 45, 106, 113
......
Core [ 6]
Column : 25
Row : 1
Core:
Status : enabled, east_lock_stall
Program Counter : 0x000001e6
Link Register : 0x000000b0
Stack Pointer : 0x00030020
DMA:
MM2S:
Channel:
Id : 0
Channel Status : stalled_on_requesting_lock
Queue Size : 0
Queue Status : okay
Current BD : 2
Id : 1
Channel Status : idle
Queue Size : 0
Queue Status : okay
Current BD : 0
S2MM:
Channel:
Id : 0
Channel Status : running
Queue Size : 0
Queue Status : okay
Current BD : 0
Id : 1
Channel Status : idle
Queue Size : 0
Queue Status : okay
Current BD : 0
Locks:
0 : acquired_for_write
1 : released_for_write
2 : released_for_write
3 : released_for_write
4 : released_for_write
5 : released_for_write
6 : released_for_write
7 : released_for_write
8 : released_for_write
9 : released_for_write
10 : released_for_write
11 : released_for_write
12 : released_for_write
13 : released_for_write
14 : released_for_write
15 : released_for_write
Events:
core : 1, 2, 5, 22, 26, 28, 29, 31, 32, 35, 38, 39, 44
memory : 1, 20, 21, 23, 35, 43, 44, 106, 113
Use following command to read specific registers for debug purposes.
$ xrt-smi advanced --read-aie-reg -d 0 0 25 Core_Status
Register Core_Status Value of Row:0 Column:25 is 0x00000201
For AI Engine register definitions, see the Versal Adaptive SoC AI Engine Register Reference (AM015).
For details on xrt-smi command use, see
Xilinx Runtime (XRT) Architecture.
For error analysis in the Vitis IDE, see Analyzing AI Engine Status.