[cgl_discussion] Use case - Boot image fallback

Takashi Ikebe ikebe.takashi at lab.ntt.co.jp
Wed Apr 13 04:46:32 PDT 2005

The following is a use case for a Boot Image Fallback.  This
addresses AVL9.0 Boot Image Fallback on CGL Specification 3.0.
Please feel free to comment / suggestion.

OSDL CGL specifies that carrier grade Linux shall provide a mechanism that enables a system to fallback to a previous "known good" boot image in the event of a catastrophic boot failure (i.e. failure to boot, panic on boot, failure to initialize HW/SW). System images are captured from the "known good" system and the system reboots to the latest good image. This mechanism would allow an automatic fallback mechanism to protect against problems resulting from system changes, such as program updates, installations, kernel changes, and configuration changes."

Desired Outcome
Mainline acceptance and distro acceptance.

System administrators setups the requirement on installation.

The scenarios are  based on the following environment.
/dev/hda1 for /boot 
/dev/hda2 for root partition (service partition)
/dev/hdb1 for 1st backup partition 
/dev/hdc1 for 2nd backup partition 
/dev/hdd1 for 3rd backup partition
System have watchdog timer(WDT).

1. After installation, system administrators take backup of system partition to 1st, 2nd, 3rd backup partitions.
2.Operate system for a while...
3.Update are provided by application developer or distributors, and system administrators updates the system partition.
4.After update, if the system or important applications does not come up(depends on your configuration), then WDT resets the system.
5.After reset, the requirement detect there is abnormal system reboot, and the requirement boot the system from 2nd backup partition.

Implementation Notes
To enable fallback kernel itself, the requirement should be implemented in lower level than kernel such as boot loader.
The requirement should have following functions;
The function which selects next boot choice if the requirement detects abnormal reboot.
The function which records the current fallback status.
The function which resets the current fallback status.
To improve system serviceability and availability, following functions should be considered;
The function which reports abnormal reboot to management system via network.
The function which power-offs system after abnormal reboot count exceeds a preset value.

Resumo project:http://resumo.sourceforge.net/

Takashi Ikebe
NTT Network Service Systems Laboratories
9-11, Midori-Cho 3-Chome Musashino-Shi,
Tokyo 180-8585 Japan
Tel : +81 422 59 4246, Fax : +81 422 60 4012
e-mail : ikebe.takashi at lab.ntt.co.jp

More information about the cgl_discussion mailing list