XRT Error Handling APIs - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID
XD100
Release Date
2024-03-05
Version
2023.2 English

XRT provides an xrt::error class and its member functions to retrieve the asynchronous errors into the user-space host code. In this section, you will walk through a methodology to handle errors from underneath driver, system, hardware, etc.

To better understand the usage of error handling XRT APIs, an out of bound access in the kernel code is introduced which in turn causes issue executing the AI Engine graph controlled from the host code.

  1. Add a memory read violation to the kernel code by opening cmd_src/aie/kernels/peak_detect.cc, and change line 26 to v_in = *(InIter+8500000500).

  2. Replace the cmd_src/sw/host.cpp file with the Hardware/host_xrtErrorAPI.cpp. Make sure to take the back up of the original file.

  3. Observe lines 87-93.

    • xrt::error -> Class to retrieve the asynchronous errors in the host code.

    • get_error_code() -> Member function to get the timestamp of the last error.

    • to_string() -> Member function to get the description string of a given error code.

  4. Do make all TARGET=hw to build the AI Engine kernels, s2mm and mm2s, the host application, link, and package steps to generate the SD card image.

  5. Repeat the steps 3 and 4 from Running the Design on Hardware to run design on hardware.

  6. Observe the output from the Linux console. xrt error api output

    aie aie0: Asserted tile error event 60 at col 25 row 1
    
    • Above is the error propogated from the AI Engine array and is used to debug the application specific errors. For the list of error events, refer to the topic AI Engine Error Events. Notice the error event 60 above which represents the DM address out of range, and the address out of range is happening in col 25 row 1.

    • You can open the graph compile summary in Vitis Analyzer and identify the kernel corresponding to the tile which is peak_detect in this case.

    • You can debug this out of bound access at AI Engine simulation level - Refer to Debugging memory access violations for more information.

    The other message in the console represents an asynchronous error ouput.

    Error Driver (4): DRIVER_AIE
    Error Severity (3): SEVERITY_CRITICAL
    Error Module (3): MODULE_AIE_CORE
    Error Class (2): CLASS_AIE
    Timestamp: 1667916688683323200
    
    • XRT maintains the latest error for each class and an associated timestamp for when the error was generated. The information of error can be interpreted from xrt_error_code.h.

    For example, Error Module (3): MODULE_AIE_CORE corresponds to XRT_ERROR_MODULE_AIE_CORE in enumeration xrtErrorModule.

  7. Press ctrl+z to suspend the execution.