If current recovery cannot bring the system back to the working state, the system must escalate to a more severe type of reset on the next WDT expiry in order to try and recover fully. It is up to you to decide on the escalation scheme. A commonly used scheme starts with APU-restart on the first watchdog expiration, followed by PS-only reset on the next watchdog expiration, then finally system-reset.
To enable escalation, PMU firmware must be built with following flags:
ENABLE_ESCALATION
Escalation Scheme
Default Scheme
Default escalation scheme checks for the successful
pm_system_shutdown
call from ATF for APU-only restart which
happens when the ATF is able to successfully idle all active CPUs. If ATF is not
successful in idling the active cores, WDT will time out again with the
WDT_in_Progess flag set, resulting in do escalation.
Escalation will trigger System level reset. System level reset is defined as PS only reset if PL is present or System restart if PL is not present.
The following figure shows the flow of the control in case of default escalation scheme.
Healthy Bit Scheme
Default scheme for escalation does not guarantee the successful reboot of the system.
It only guarantees the successful role of ATF to idle the CPU during the recovery.
Consider the scenario in which the FPD_WDT has timed out and APU subsystem restart
is called in which ATF is able to successfully make the
pm_system_shutdown
call. However, APU subsystem restart is far
from finished after pm_system_shutdown
is called. The restart
process can be stuck elsewhere, such as fsbl, u-boot or Linux init state. If the
restart process is stuck in one of the aforementioned tasks, FPD_WDT will expire
again, causing the same cycle to be repeated as long as ATF is loaded and
functioning. This cycle can continue indefinitely without the system booting back
into a clean running state.
The Healthy Bit scheme solves this problem. In addition to default scheme, the PMU firmware checks for a Healthy Bit, which is set by Linux on successful booting. On WDT expiry, if Healthy Bit is set, it indicates that Linux is able to boot into a clean running state, then no escalation is needed. However, if Healthy Bit is not set, that means the last restart attempt did not successfully boot into Linux and escalation is needed. There is no need to repeat the same type of restart. PMU firmware will escalate and call a system level reset.
Healthy Bit scheme is implemented using the bit-29 of PMU global general storage register (PMU_GLOBAL_GLOBAL_GEN_STORAGE0[29]). PMU firmware clears the bit before starting the recovery or normal reboot and Linux must set this bit to flag a healthy boot.
PMU global registers are accessed through sysfs interface from Linux. Hence, to set the healthy bit from the Linux, execute the following command (or include in the code):
# echo "0x20000000 0x20000000" > "/sys/devices/platform/firmware/ggs0"
To enable the healthy bit based escalation scheme, build the PMU firmware with the following flag:
CHECK_HEALTHY_BOOT
The following figure shows the flow of the control in case of the healthy bit escalation scheme.
Customizing Recovery and Escalation Scheme
By default, when FPD WDT times out, PMU FW will not invoke any type of restart. While Xilinx has provided predefined RECOVERY and ESCALATION behaviors, users can easily customize different desired schemes.
When FPD _WDT times out, it calls FpdSwdtHandler
. If ENABLE_EM is
defined, FpdSwdtHandler
calls
XPfw_recoveryHandler
. It is otherwise an empty function.
In xpfw_mod_em.c,
#ifdef ENABLE_EM
oid FpdSwdtHandler(u8 ErrorId)
{
XPfw_Printf(DEBUG_ERROR,"EM: FPD Watchdog Timer Error (Error ID: %d)\r\n", ErrorId);
XPfw_RecoveryHandler(ErrorId);
}
#else
void FpdSwdtHandler(u8 ErrorId) { }
Without ENABLE_EM, you can simply update FpdSwdtHandler
which will
be called at FPD Timeout. With ENABLE_EM turned on, you need to update
XPfw_recoveryHandler
.
Similarly, turning on RECOVERY defines the XPfw_RecoveryHandler
(see
xpfw_restart.c). Unless RECOVERY is turned on,
XPfw_ RecoveryHandler
is an empty function and nothing will
happen when FPD_WDT times out.
RecoveryHandler
basically follows the flow chart detailed in the
Escalation Scheme section. When FPD_WDT times out, the code follows the progression
of orange boxes. If WDT is not already in progress, Restart WDT, Set WDT_In_Progress
flag, Raise TTC (timer 9) interrupt to ATF. Then ATF takes over. It Raises SW
interrupt for all active cores, clear pending interrupts, etc. (see blue boxes).
Essentially, PMU restarts and boosts the WDT, then sends a request to ATF. ATF
cleanly idles all four APUs and when they all get to WFI (Last Active Core is true),
ATF issues PMU System Shutdown with APU subsystem as argument back to PMU. When PMU
gets this command, it invokes APU subsystem restart.
If ENABLE_ESCALATION is not set, the code never takes the Do Escalation path. If the
RecoveryHandler
hangs for some reason (for example, something
went wrong and APU cannot put all four CPU cores to WFI), it keeps retrying APU
restart or hang forever. When ENABLE_ESCLATION is on and if anything goes wrong
during execution of the flowchart, it will look like WDT is still in progress (since
clear WDT_in_progress flag happens only as the last step), Do Escalation will call
SYSTEM_RESET instead of trying APU-restart again and again.
To customize recovery and escalation behavior, use the provided
XPfw_recoveryHandler
as a template to provide a customized
XPfw_recoveryHandler
function.