[cgl_discussion] Project Review: Boot Cycle Detection

Gustafson, Geoffrey R geoffrey.r.gustafson at intel.com
Wed Sep 11 12:00:58 PDT 2002


Project Review for the Hardened Drivers Project
-----------------------------------------------

1. Quote the requirements from the requirements doc
   that your project is expected to meet.

Requirement: 2.3 Boot Cycle Detection
Version Assignment:  Configurable 1.1
·	The goal of OSDL CGL is to utilize a standard API such that
applications can be guaranteed to work across distributions.  As no such
standard yet exists we don't yet consider this to be a core function.

Application Type:  G, S, M
Description: 	OSDL CGL shall detect a frequent reboot cycle due to
recurring failures and will go offline if this occurs.

2. Explain how you think the project meets the
   above requirements.

   We clarified in the requirements mailing lists that we intend this to be
a
   solution for disk-based machines, with detection occurring after the
kernel
   comes up. For network boot configurations, this feature can be provided
by
   middleware on the boot server. A boot cycle detection feature provided at
a
   lower level, e.g. in the boot loader, would be i) more complex, ii) more
   platform-specific as it requires some type of nonvolatile storage. As no
CGL
   member had an implementation of such a thing, or a burning desire to
create
   one, this was infeasible for v1.0.

   This implementation allows you to configure the maximum number of
consecutive
   reboots considered acceptable, and the minimum uptime the system must
   experience before resetting the count of reboots. If the maximum reboots
are
   exceeded, the system shuts down.

   Thus the vague terms "frequent" and "recurring failures" are defined by
the
   end-user through configuration, as appropriate for the particular
application
   set. The minimum uptime essentially defines "frequent", and the maximum
   reboots defines the number of "recurring failures".

3. Explain the design of the project or point to
   a document on the web that explains the design.

   bootcycle is a simple implementation via bash shell scripts.

   The feature is configured through the file /etc/bootcycle.conf.
   It records a boot counter in /var/lib/misc/bootcycle.status
   It logs messages to /var/log/bootcycle.log (as it runs before either
syslog
    or the POSIX event log are available, this is the best option).
   A man page bootcycle(8) is provided with more information.

   Design summary:
   The bootcycle utility should be executed after each boot before the init
   scripts. If enabled, it checks the previous value of a boot counter. If
   the counter exceeds the configured maximum number of consecutive reboots,
   bootcycle resets the counter and shuts down the system. Otherwise, it
   increments the counter and sleeps until the configured minimum uptime
   has elapsed, then resets the counter to zero. Thus the counter only
   remains incremented if the system reboots within that time.

   This program is primarily useful in a complex network environment with
many
   servers being managed. It is easier to identify that a machine is in need
of
   service if it fails entirely. If it is stuck in a reboot cycle,
administrators
   may assume that it is coming back online. When there is load sharing
among
   other servers, it is better for the failed machine to shutdown rather
than
   begin handling traffic only to drop it repeatedly.

   The bootcycle.conf file allows you to enable/disable the feature, set the
   maximum reboots, and set the minimum uptime (in minutes).

4. Pointer to the code/patch.

   http://cvs.developer.osdl.org/viewcvs/viewcvs.cgi/components/bootcycle/

Geoff



More information about the cgl_discussion mailing list