[cgl_discussion] Use case - Boot cycle detection

Takashi Ikebe ikebe.takashi at lab.ntt.co.jp
Thu Apr 14 20:11:57 PDT 2005


I updated scenario like below;

Scenarios
---------------
System administrators activate the function during setup.
During operation, generally system administrators monitor the system 
health from remote operation center. The function enables to detect 
reboot cycle due to recurring failures by shutting down the system or 
notifying the operator via network if the reboot counter exceeds thresholds.
If system wants to detect only service startup, then the reboot counter 
should be reseted as S99 init script. If system wants to detect whole 
system lifetime health(boot,operate, shutdown), then the reboot counter 
should be reseted as K99 init script.
If a node located at a remote location is configured to re-boot 
automatically, if kernel hang (with watch-dog timer enabled), a 
boot-cycle detection policy for this node might be configured to detect 
and prevent this node from re-booting X times in Y minutes interval as 
policy engine can automatically deem such situation is a node failure.


Chen, Terence wrote:
>>>>Implementation Notes
>>>>The function should have following functions at least;
>>>>1.The counter of recurring reboot.
>>>>2.The function which resets the counter.
>>>>3.The function which power off the machine.
>>>>The functions increment the counter on each boot time, and if the
> 
> system
> 
>>>>boots up normally, then the function resets the counter. If the
> 
> counter
> 
>>>>exceeds thresholds, then the function shouts down the system.
>>>
>>>
>>>Takashi, at what point do you consider the node to boot normally?
> 
> After
> 
>>>all the applications have been started up and are responding to some
>>>kind of health monitoring?
>>
>>Well it's depends on system policy, If system wants to detect only
>>service startup, then the resets function will run as S99something,
>>and if system wants to detect whole system lifetime health(boot,
>>operate, shutdown..), then the reset function should run as
> 
> K99something
> 
>>in init script.(with this pattern, you can detect sudden kernel
>>panic/serious application error!)
>>
> 
> [Chen, Terence] A policy based detection engine can be tailored for
> different deployment. For example of an usage scenario - a node located
> at a remote location is configured to re-boot automatically if kernel
> hang (with watch-dog timer enabled), a boot-cycle detection policy for
> this node might be configured to detect and prevent this node from
> re-booting X times in Y minutes interval as policy engine can
> automatically deem such situation is a node failure.
>    
> 
>>>>System administrators can know the system error via machine shutting
>>
>>down.
>>
>>>Is the only remedy to shut down the machine?  As Tim mentioned, does
>>>this tie in to boot image fallback?
>>
>>Sure! Work with boot image fallback make more system available.
>>Add the relationship in implementation Notes;
>>


-- 
Takashi Ikebe
NTT Network Service Systems Laboratories
9-11, Midori-Cho 3-Chome Musashino-Shi,
Tokyo 180-8585 Japan
Tel : +81 422 59 4246, Fax : +81 422 60 4012
e-mail : ikebe.takashi at lab.ntt.co.jp



More information about the cgl_discussion mailing list