Escalation - 2023.2 English

Zynq UltraScale+ MPSoC Software Developer Guide (UG1137)

Document ID
UG1137
Release Date
2023-11-28
Version
2023.2 English

If current recovery cannot bring the system back to the working state, the system must escalate to a more severe type of reset on the next WDT expiry in order to try and recover fully. It is up to you to decide on the escalation scheme. A commonly used scheme starts with APU-restart on the first watchdog expiration, followed by PS-only reset on the next watchdog expiration, then finally system-reset.

To enable escalation, PMU firmware must be built with following flags:

ENABLE_ESCALATION
Escalation Scheme

Default Scheme

Default escalation scheme checks for the successful pm_system_shutdown call from TF-A for APU-only restart which happens when the TF-A is able to successfully idle all active CPUs. If TF-A is not successful in idling the active cores, WDT will time out again with the WDT_in_Progess flag set, resulting in do escalation.

Escalation will trigger System level reset. System level reset is defined as PS only reset if PL is present or System restart if PL is not present.

The following figure shows the flow of the control in case of default escalation scheme.

Figure 1. Flow of Control for Default Escalation Scheme

Healthy Bit Scheme

Default scheme for escalation does not guarantee the successful reboot of the system. It only guarantees the successful role of TF-A to idle the CPU during the recovery. Consider the scenario in which the FPD_WDT has timed out and APU subsystem restart is called in which TF-A is able to successfully make the pm_system_shutdown call. However, APU subsystem restart is far from finished after pm_system_shutdown is called. The restart process can be stuck elsewhere, such as fsbl, u-boot or Linux init state. If the restart process is stuck in one of the aforementioned tasks, FPD_WDT will expire again, causing the same cycle to be repeated as long as TF-A is loaded and functioning. This cycle can continue indefinitely without the system booting back into a clean running state.

The Healthy Bit scheme solves this problem. In addition to default scheme, the PMU firmware checks for a Healthy Bit, which is set by Linux on successful booting. On WDT expiry, if Healthy Bit is set, it indicates that Linux is able to boot into a clean running state, then no escalation is needed. However, if Healthy Bit is not set, that means the last restart attempt did not successfully boot into Linux and escalation is needed. There is no need to repeat the same type of restart. PMU firmware will escalate and call a system level reset.

Healthy Bit scheme is implemented using the 0th bit of pmu global general storage register (PMU_GLOBAL_GLOBAL_GEN_STORAGE4[0]). PMU firmware clears the bit before starting the recovery or normal reboot and Linux should set this bit to flag a healthy boot.

PMU global registers are accessed through sysfs interface from Linux. Hence, to set the healthy bit from the Linux, execute the following command (or include in the code):


# echo "0x20000000 0x20000000" > "/sys/devices/platform/firmware/ggs0"

To enable the healthy bit based escalation scheme, build the PMU firmware with the following flag:

CHECK_HEALTHY_BOOT

The following figure shows the flow of the control in case of the healthy bit escalation scheme.

Figure 2. Healthy Bit Escalation Scheme

Customizing Recovery and Escalation Scheme

By default, when FPD WDT times out, PMU FW will not invoke any type of restart. While AMD has provided predefined RECOVERY and ESCALATION behaviors, users can easily customize different desired schemes.

When FPD _WDT times out, it calls FpdSwdtHandler. If ENABLE_EM is defined, FpdSwdtHandler calls XPfw_recoveryHandler. It is otherwise an empty function.

In xpfw_mod_em.c,

#ifdef ENABLE_EM
oid FpdSwdtHandler(u8 ErrorId)
{
XPfw_Printf(DEBUG_ERROR,"EM: FPD Watchdog Timer Error (Error ID: %d)\r\n", ErrorId);
XPfw_RecoveryHandler(ErrorId);
}

#else
void FpdSwdtHandler(u8 ErrorId) { }

Without ENABLE_EM, you can simply update FpdSwdtHandler which will be called at FPD Timeout. With ENABLE_EM turned on, you need to update XPfw_recoveryHandler.

Similarly, turning on RECOVERY defines the XPfw_RecoveryHandler (see xpfw_restart.c). Unless RECOVERY is turned on, XPfw_ RecoveryHandler is an empty function and nothing will happen when FPD_WDT times out.

RecoveryHandler basically follows the flow chart detailed in the Escalation Scheme section. When FPD_WDT times out, the code follows the progression of orange boxes. If WDT is not already in progress, Restart WDT, Set WDT_In_Progress flag, Raise TTC (timer 9) interrupt to TF-A. Then TF-A takes over. It Raises SW interrupt for all active cores, clear pending interrupts, etc. (see blue boxes). Essentially, PMU restarts and boosts the WDT, then sends a request to TF-A. TF-A cleanly idles all four APUs and when they all get to WFI (Last Active Core is true), TF-A issues PMU System Shutdown with APU subsystem as argument back to PMU. When PMU gets this command, it invokes APU subsystem restart.

If ENABLE_ESCALATION is not set, the code never takes the Do Escalation path. If the RecoveryHandler hangs for some reason (for example, something went wrong and APU cannot put all four CPU cores to WFI), it keeps retrying APU restart or hang forever. When ENABLE_ESCLATION is on and if anything goes wrong during execution of the flowchart, it will look like WDT is still in progress (since clear WDT_in_progress flag happens only as the last step), Do Escalation will call SYSTEM_RESET instead of trying APU-restart again and again.

To customize recovery and escalation behavior, use the provided XPfw_recoveryHandler as a template to provide a customized XPfw_recoveryHandler function.