If current recovery cannot bring the system back to the working state, the system must escalate to a more severe type of reset on the next WDT expiry in order to try and recover fully. It is up to you to decide on the escalation scheme. A commonly used scheme starts with APU-restart on the first watchdog expiration, followed by PS-only reset on the next watchdog expiration, then finally system-reset.
To enable escalation, PMU firmware must be built with following flags:
ENABLE_ESCALATION
Escalation Scheme
Default Scheme
Default escalation scheme checks for the successful pm_system_shutdown
call from TF-A for APU-only restart which happens
when the TF-A is able to successfully idle all active CPUs. If TF-A is not
successful in idling the active cores, WDT will time out again with the
WDT_in_Progess flag set, resulting in do escalation.
Escalation will trigger System level reset. System level reset is defined as PS only reset if PL is present or System restart if PL is not present.
The following figure shows the flow of the control in case of default escalation scheme.
Healthy Bit Scheme
Default scheme for escalation does not guarantee the successful reboot of the
system. It only guarantees the successful role of TF-A to idle the CPU during the
recovery. Consider the scenario in which the FPD_WDT has timed out and APU subsystem
restart is called in which TF-A is able to successfully make the pm_system_shutdown
call. However, APU subsystem restart
is far from finished after pm_system_shutdown
is
called. The restart process can be stuck elsewhere, such as fsbl, u-boot or Linux
init state. If the restart process is stuck in one of the aforementioned tasks,
FPD_WDT will expire again, causing the same cycle to be repeated as long as TF-A is
loaded and functioning. This cycle can continue indefinitely without the system
booting back into a clean running state.
The Healthy Bit scheme solves this problem. In addition to default scheme, the PMU firmware checks for a Healthy Bit, which is set by Linux on successful booting. On WDT expiry, if Healthy Bit is set, it indicates that Linux is able to boot into a clean running state, then no escalation is needed. However, if Healthy Bit is not set, that means the last restart attempt did not successfully boot into Linux and escalation is needed. There is no need to repeat the same type of restart. PMU firmware will escalate and call a system level reset.
Healthy Bit scheme is implemented using the 0th bit of pmu global general storage register (PMU_GLOBAL_GLOBAL_GEN_STORAGE4[0]). PMU firmware clears the bit before starting the recovery or normal reboot and Linux should set this bit to flag a healthy boot.
PMU global registers are accessed through sysfs interface from Linux. Hence, to set the healthy bit from the Linux, execute the following command (or include in the code):
# echo "0x20000000 0x20000000" > "/sys/devices/platform/firmware/ggs0"
To enable the healthy bit based escalation scheme, build the PMU firmware with the following flag:
CHECK_HEALTHY_BOOT
The following figure shows the flow of the control in case of the healthy bit escalation scheme.
Customizing Recovery and Escalation Scheme
By default, when FPD WDT times out, PMU FW will not invoke any type of restart. While AMD has provided predefined RECOVERY and ESCALATION behaviors, users can easily customize different desired schemes.
When FPD _WDT times out, it calls FpdSwdtHandler
. If ENABLE_EM is
defined, FpdSwdtHandler
calls
XPfw_recoveryHandler
. It is otherwise an empty function.
In xpfw_mod_em.c,
#ifdef ENABLE_EM
oid FpdSwdtHandler(u8 ErrorId)
{
XPfw_Printf(DEBUG_ERROR,"EM: FPD Watchdog Timer Error (Error ID: %d)\r\n", ErrorId);
XPfw_RecoveryHandler(ErrorId);
}
#else
void FpdSwdtHandler(u8 ErrorId) { }
Without ENABLE_EM, you can simply update FpdSwdtHandler
which will
be called at FPD Timeout. With ENABLE_EM turned on, you need to update
XPfw_recoveryHandler
.
Similarly, turning on RECOVERY defines the XPfw_RecoveryHandler
(see
xpfw_restart.c). Unless RECOVERY is turned on,
XPfw_ RecoveryHandler
is an empty function and nothing will
happen when FPD_WDT times out.
RecoveryHandler
basically follows the flow
chart detailed in the Escalation Scheme section. When FPD_WDT times out, the code
follows the progression of orange boxes. If WDT is not already in progress, Restart
WDT, Set WDT_In_Progress flag, Raise TTC (timer 9) interrupt to TF-A. Then TF-A
takes over. It Raises SW interrupt for all active cores, clear pending interrupts,
etc. (see blue boxes). Essentially, PMU restarts and boosts the WDT, then sends a
request to TF-A. TF-A cleanly idles all four APUs and when they all get to WFI (Last
Active Core is true), TF-A issues PMU System Shutdown with APU subsystem as argument
back to PMU. When PMU gets this command, it invokes APU subsystem restart.
If ENABLE_ESCALATION is not set, the code never takes the Do Escalation path. If the
RecoveryHandler
hangs for some reason (for example, something
went wrong and APU cannot put all four CPU cores to WFI), it keeps retrying APU
restart or hang forever. When ENABLE_ESCLATION is on and if anything goes wrong
during execution of the flowchart, it will look like WDT is still in progress (since
clear WDT_in_progress flag happens only as the last step), Do Escalation will call
SYSTEM_RESET instead of trying APU-restart again and again.
To customize recovery and escalation behavior, use the provided
XPfw_recoveryHandler
as a template to provide a customized
XPfw_recoveryHandler
function.