[cgl_discussion] Use case - Boot cycle detection
cherry at osdl.org
Wed Apr 13 09:14:01 PDT 2005
On Wed, 2005-04-13 at 21:22 +0900, Takashi Ikebe wrote:
> The following is a use case for a boot cycle detection. This
> addresses SMM6.0 Boot Cycle Detection on CGL Specification 3.0.
> Please feel free to comment / suggestion.
> OSDL CGL specifies that carrier grade Linux shall provide support for
> repeating reboot cycle due to recurring failures. This detection should
> happen in user space before system services are started. This type of
> failure requires a response due to the negative impact of repeatedly
> taking down services. A configurable policy is needed to set thresholds
> of cycling and desired shutdown actions, such as exponential back off,
> shutdown, or notifying administrators.
> Desired Outcome
> Mainline acceptance and distro acceptance.
> System administrators use the function during server operation.
> System administrators activate the function during setup.
> During operation, generally system administrators monitor the system
> health from remote operation center. The function enables to detect
> reboot cycle due to recurring failures by shutting down the system or
> notifying the operator via network.
> Implementation Notes
> The function should have following functions at least;
> 1.The counter of recurring reboot.
> 2.The function which resets the counter.
> 3.The function which power off the machine.
> The functions increment the counter on each boot time, and if the system
> boots up normally, then the function resets the counter. If the counter
> exceeds thresholds, then the function shouts down the system.
Takashi, at what point do you consider the node to boot normally? After
all the applications have been started up and are responding to some
kind of health monitoring?
> System administrators can know the system error via machine shutting down.
Is the only remedy to shut down the machine? As Tim mentioned, does
this tie in to boot image fallback?
> In addition to above functions, following functions may increase the
> system serviceability;
> The function which report the reboot status to remote operation node via
> Resumo project:http://resumo.sourceforge.net/
> (does not have function yet, but soon available.)
> cgl_discussion mailing list
> cgl_discussion at lists.osdl.org
More information about the cgl_discussion