Recovering from a Hang System - 2023.2 English

Zynq UltraScale+ MPSoC Software Developer Guide (UG1137)

Document ID
UG1137
Release Date
2023-11-28
Version
2023.2 English

In an event of system hang, as indicated by FPT WDT timeout, PMU can be used to carry out a sequence of events to try and recover from the unresponsive condition. By default, when FPD WDT times out, PMU firmware will not invoke any type of restart. This is so that user can specify the exact desired behavior. However, AMD provides a typical recovery scheme in which PMU firmware monitors the state of APU subsystem using FPD WDT and restart APU (Linux) subsystem if the timer expires, indicating problem with Linux.

Since RPU subsystem is managed by Linux using remoteproc, the life-cycle of the RPU subsystem is completely up to Linux. PMU is not involved in deciding when to restart RPU subsystem(s). RPU hang recovery can also be implemented with help of either software or hardware watchdog between APU and RPU subsystems. In that case, the watchdog is configured and handled by Linux but the heartbeats is provided by RPU application(s). The exact method of deciding when to restart RPU is up to the user, watchdog is simply one of many possibilities. To enable recovery, PMU firmware should be built with enabling error management and recovery. Following macros enable the Recovery feature:

  • ENABLE_EM
  • ENABLE_RECOVERY

It is also necessary to build TF-A with following flags (see APU Idling for details):

ZYNQMP_WARM_RESTART=1

Important: One TTC timer (timer 9) will be reserved for PMU's use when these compile flags are enabled.