AI Engine Runtime Error Handling - 2024.1 English - UG1642

AI Engine System Software Driver Reference Manual (UG1642)

Document ID
UG1642
Release Date
2024-05-30
Version
2024.1 English

The AI Engine errors events that are considered errors to generate interrupt are configured at compilation during CDO generation. At runtime, software stack only reports errors but does not change the errors interrupt configuration. In the case of XRT flow, AI Engine kernel driver notifies the XRT kernel driver about the errors happening by calling the XRT registered callback. The XRT kernel driver handles all the errors. XRT software stack needs to provide APIs for high level libraries or applications to inquire about graph errors. After the AI Engine loads, errors can happen. If no application requests the partition when errors have happened, the errors are not cleared. It notifies XRT later when XRT registers for the error’s callback.

In cases where there is no XRT in the flow, userspace libraries handle errors. AI Engine file descriptor are used to poll errors. The AI Engine embeddedsw driver enables error notification by polling the AI Engine file descriptor. The AI Engine embeddedsw driver provides a wrapper API for you to poll. AMD removed signaling applications with errors and error callback registration from the application so that there is no need to spawn a thread to monitor errors. The AI Engine embeddedsw driver provides APIs for you to get the details of the groups of errors that happened and the details of the errors. When errors happen, the application is expected to reset and restart to recover from the errors.

Figure 1. Bare-metal AI Engine Runtime Errors Handling