[cgl_discussion] Use case - Boot cycle detection

Chen, Terence terence.chen at intel.com
Thu Apr 14 13:43:22 PDT 2005


>>>Implementation Notes
>>>The function should have following functions at least;
>>>1.The counter of recurring reboot.
>>>2.The function which resets the counter.
>>>3.The function which power off the machine.
>>>The functions increment the counter on each boot time, and if the
system
>>>boots up normally, then the function resets the counter. If the
counter
>>>exceeds thresholds, then the function shouts down the system.
>>
>>
>> Takashi, at what point do you consider the node to boot normally?
After
>> all the applications have been started up and are responding to some
>> kind of health monitoring?
>
>Well it's depends on system policy, If system wants to detect only
>service startup, then the resets function will run as S99something,
>and if system wants to detect whole system lifetime health(boot,
>operate, shutdown..), then the reset function should run as
K99something
>in init script.(with this pattern, you can detect sudden kernel
>panic/serious application error!)
>
[Chen, Terence] A policy based detection engine can be tailored for
different deployment. For example of an usage scenario - a node located
at a remote location is configured to re-boot automatically if kernel
hang (with watch-dog timer enabled), a boot-cycle detection policy for
this node might be configured to detect and prevent this node from
re-booting X times in Y minutes interval as policy engine can
automatically deem such situation is a node failure.
   
> >>System administrators can know the system error via machine shutting
>down.
>> Is the only remedy to shut down the machine?  As Tim mentioned, does
>> this tie in to boot image fallback?
>
>Sure! Work with boot image fallback make more system available.
>Add the relationship in implementation Notes;
>




More information about the cgl_discussion mailing list