[cgl_discussion] Use case - Boot cycle detection

Takashi Ikebe ikebe.takashi at lab.ntt.co.jp
Thu Apr 14 06:47:19 PDT 2005


John Cherry wrote:

>>Scenarios
>>System administrators activate the function during setup.
>>During operation, generally system administrators monitor the system
>>health from remote operation center. The function enables to detect
>>reboot cycle due to recurring failures by shutting down the system or
>>notifying the operator via network.
>>
>>Implementation Notes
>>The function should have following functions at least;
>>1.The counter of recurring reboot.
>>2.The function which resets the counter.
>>3.The function which power off the machine.
>>The functions increment the counter on each boot time, and if the system
>>boots up normally, then the function resets the counter. If the counter
>>exceeds thresholds, then the function shouts down the system.
> 
> 
> Takashi, at what point do you consider the node to boot normally?  After
> all the applications have been started up and are responding to some
> kind of health monitoring?

Well it's depends on system policy, If system wants to detect only
service startup, then the resets function will run as S99something,
and if system wants to detect whole system lifetime health(boot,
operate, shutdown..), then the reset function should run as K99something
in init script.(with this pattern, you can detect sudden kernel
panic/serious application error!)

 >>System administrators can know the system error via machine shutting
down.
> Is the only remedy to shut down the machine?  As Tim mentioned, does
> this tie in to boot image fallback?

Sure! Work with boot image fallback make more system available.
Add the relationship in implementation Notes;

Implementation Notes
-----------------------------
The function should have following functions at least;
- 1.The counter of recurring reboot.
- 2.The function which resets the counter.
- 3.The function which power off the machine.
The functions increment the counter on each boot time, and if the system
boots up normally, then the function resets the counter. If the counter
exceeds thresholds, then the function shouts down the system.
The requirement may enhance system availability by implement/collaborate
AVL9.0 Boot Image Fallback function which described on CGL Specification
3.0.

-- 
Takashi Ikebe
NTT Network Service Systems Laboratories
9-11, Midori-Cho 3-Chome Musashino-Shi,
Tokyo 180-8585 Japan
Tel : +81 422 59 4246, Fax : +81 422 60 4012
e-mail : ikebe.takashi at lab.ntt.co.jp



More information about the cgl_discussion mailing list