[cgl_discussion] Project Review: Panic Handler Enhancements

Eric.Chacron at alcatel.fr Eric.Chacron at alcatel.fr
Thu Oct 10 08:38:00 PDT 2002


I globally agree with your answer.
Today i think IPMI implements only SNMP trap for alert on network, and it's
not possible to send an alert message
from the panic handler itself as the panic handler should work with
interrupt masked ... whereas
the BMC processor can do it .
So the possibility to use another protocol than SNMP will depend on IPMI
specification, not panic
handler spec.

Eric




"Vincent, Perry G" <perry.g.vincent at intel.com> on 10/10/2002 15:24:30

To:   Eric CHACRON/FR/ALCATEL at ALCATEL, "Cress, Andrew R"
      <andrew.r.cress at intel.com>
cc:   cgl_discussion at osdl.org
Subject:  RE: [cgl_discussion] Project Review: Panic Handler Enhancements


I think it depends on the receiver's technology.  Andy mentioned a
NOC-resident receiver/listener, but the trap host target could be a target
local to the cluster, either another cluster member or an independent
cluster-managing node.

The feature enables the ability to activate a pre-configured trap to be
transmitted by the system hardware, so at the sender, there is no
SNMP/network stack to negotiate--the panic handler invokes the system
hardware capability directly and the hardware fires the message.

I think the primary benefit is in decreasing the amount of time needed to
detect the failed node -- rather than waiting for a heartbeat delay/failure
to signal a failed member who panic'd, the failed member identifies himself
proactively.

Finally, the panic handler should not prescribe the network protocol used
for the alert; however, it should not limit it either.  Perhaps this
requirement should be broadened to indicate that panic handlers for "LAN
alerts on panic" should be written for other protocols also.  There is one
now for SNMP traps, but there could/should be other options for
multi-casting according to [tbd] standard, for WBEM, for ASF, etc. The
hardware/firmware ultimately implements the network access; the panic
handler simply accesses it.

Perry


-----Original Message-----
From: Eric.Chacron at alcatel.fr [mailto:Eric.Chacron at alcatel.fr]
Sent: Thursday, October 10, 2002 8:16 AM
To: Cress, Andrew R
Cc: cgl_discussion at osdl.org
Subject: RE: [cgl_discussion] Project Review: Panic Handler Enhancements




>4) The clustering model could benefit from these enhancements as they
stand,
>since the SNMP alert for a panic will be sent.   It may be desirable for
>clustering middleware's fault manager to leverage the SEL & BMC LAN
>messaging, if they are present (probably SA HPI would apply).  If the
>Network Operations Center (receving the SNMP messages) has some automated
>action capability, it could perform a directed failover via the clustering
>API.  The clustering package would need to support a directed failover via
>either a command-line remote utility, or a remote API.

Here is the problem i saw for clustering. If SNMP stack is used the delay
to receive
the trap and start a "recovery"  action could be too long (maximum delay ?)
.
Whereas using a lower layer
like MAC multicast a message could be broadcatsed to every node.

---
Eric








"Cress, Andrew R" <andrew.r.cress at intel.com>@lists.osdl.org on 01/10/2002
17:44:42

Sent by:  cgl_discussion-admin at lists.osdl.org


To:   Eric CHACRON/FR/ALCATEL at ALCATEL
cc:   cgl_discussion at osdl.org
Subject:  RE: [cgl_discussion] Project Review: Panic Handler Enhancements


Eric,

Good input.  Thanks.
Let me try to answer these.

1) I chose to use the ASCII text as the data, given the free-form nature of
most panic calls.  Usually the module name is the first part of the panic
string.  Adding typing to all panic calls in the kernel would be
prohibitive.  However, there may be some typing that can be derived (NULL
pointers, invalid interrupt, etc.) and some derived module code that would
be even better.  I'll investigate this further to see if this can be done
for the next cut.

2) These Panic Handler Enhancements and the dumping method should coexist
without impacting each other, via the panic notifier list.  The dumping
mechanism may have additional hooks to save the snapshot before processing
the notifier list though.

3) The SEL records can be shown or post-processed from Linux via the
showsel
utility.  If the system is offline, BMC LAN could be used to view the SEL.
Via showsel, any new records can be written to the Linux syslog if desired
(and POSIX evlog via evlforward).  POSIX Event Log actions can then be
configured by the user/admin.

4) The clustering model could benefit from these enhancements as they
stand,
since the SNMP alert for a panic will be sent.   It may be desirable for
clustering middleware's fault manager to leverage the SEL & BMC LAN
messaging, if they are present (probably SA HPI would apply).  If the
Network Operations Center (receving the SNMP messages) has some automated
action capability, it could perform a directed failover via the clustering
API.  The clustering package would need to support a directed failover via
either a command-line remote utility, or a remote API.  However, I think
the
main requirement for clustering should focus on making sure that a cluster
membership change event would be sent via SNMP by the surviving nodes.

Andy

-----Original Message-----
From: Eric.Chacron at alcatel.fr [mailto:Eric.Chacron at alcatel.fr]
Sent: Tuesday, October 01, 2002 10:13 AM
To: Cress, Andrew R
Cc: cgl_discussion at osdl.org
Subject: RE: [cgl_discussion] Project Review: Panic Handler Enhancements


Andrew,

Some comments on the panic handler enhancements (sorry for delay).

1) the length of the message to be stored in SEL is limited tio 16 bytes.
Is it possible to store not ASCII text in the SEL but only an error code
and to manage a table of panic codes inside the kernel ?

2) page 6 "LKCD" / in order to have a crash dump on diskless node saved
through NFS
do you think panic handler should be dependant on MCORE / LKCD method ?

3) dumping SEL content (or only last events) in crash dump can be an added
feature also.

4) sending a SNMP message to a pre-configured IP address: i think this
feature or another similar one based on a lower network layer (MAC
multicast for instance)  could be used by the clustering model in order to
enable a faster switchover from the failed node to a standby node. So i
suggest trying to link the "clustering model" with that componant (called
panic handler enhancements) .

regards,
Eric
---------






"Cress, Andrew R" <andrew.r.cress at intel.com>@lists.osdl.org on 11/09/2002
16:52:26

Sent by:  cgl_discussion-admin at lists.osdl.org


To:   Eric CHACRON/FR/ALCATEL at ALCATEL
cc:   cgl_discussion at osdl.org
Subject:  RE: [cgl_discussion] Project Review: Panic Handler Enhancements


Eric,

RE: crash dump.
Good point, and I didn't include that in the short description.
All Linux panic handling features:  bmc_panic, crash dump, and kernel
debugger need to function smoothly together.  This is included in the
design
document (attached), and this bmc_panic kernel feature has been tested with
LKCD, KDB, and KGDB to ensure that these features work smoothly together,
and that a reboot occurs at the specified timeout after all of the panic
features complete.

RE: Alarms Panel
The alarms panel, in this case, is accessed via the IPMB using the IPMI
MasterWriteRead command.  The subroutine that does this would be
implemented
differently on non-IPMI platforms.  I don't know of any standard API that
covers the alarms panel, so each non-IPMI platform would thus have a
separate get/set subroutine for the alarms panel LED(s).  In the current
code, if the alarms panel is not detected, these subroutines are skipped.

RE: SNMP alerting
This BMC firmware capability is defined in the IPMI specification, although
not all systems may implement it.  This feature enables this optimal form
of
SNMP notification even if the OS stays down.  It has a configurable alert
destination, and its OID is specified in the IPMI spec (enterprises.3183),
but I still need to add a MIB file for network administrators.  If the
system does not have this capability in firmware, one of the panicsel
utilities (showsel) does add the ability to update the Linux syslog with
firmware log entries (after Linux comes back up), and there are many tools
to send SNMP alerts in Linux, based on patterns from syslog.  Also, if the
OS does not come back up (God forbid :-), remote access directly to the
firmware log (via serial or LAN) provides not only the panic entry, but any
other sensor events that occurred leading up to it.

RE: SA Forum meta-standard API for system management
www.saforum.org does not show this yet, but it is expected to be published
in October.  It should have an API call for each of the defined
sub-functions, so that a library underneath could be built for each
platform, a common one for IPMI, and individual ones for specific
implementations.

Andy

-----Original Message-----
From: Eric.Chacron at alcatel.fr [mailto:Eric.Chacron at alcatel.fr]
Sent: Wednesday, September 11, 2002 8:33 AM
To: Cress, Andrew R
Cc: cgl_discussion at osdl.org
Subject: Re: [cgl_discussion] Project Review: Panic Handler Enhancements


Andrew,

Please let me add some comments on this feature.

Eric
----

>>>4.11 Linux Panic Handler Enhancement
>>>OSDL CGL shall support enriched capabilities on system panic. Currently
the
>>>default system panic behavior is to print a short message to the console
and
>>>halt the system. OSDL CGL shall provide a set of configurable functions
>>>including log panic event to system event log as well as the options to
>>>reboot, power down, or power cycle when panic event occurs.

I think that this feature should be compatible with
crash dump generation.
Today panic behavior is not only to print and halt the system but also to
call
crash dump before rebooting if the option for that is choosen.
I think the requirement is to have the following ordering for an external
user:
- 1)user defined panic handler called.
- 2)log in BMC SEL / SNMP notification ...
- 3)crash dump generation
- 4)hardware reset / power cycle / ...
Is there any  problem to synchronise 3) and 4) ?


>>>- write OS Critical Stop message to firmware System Event Log (SEL)
>>>- turn on the Critical Alarm LED on the Telco Alarms Panel

How can this module be able to control the Alarms Panel
( which is architecture and application dependant i think) ?

>>>- send SNMP trap via BMC LAN Alerting mechanism
This supposes the BMC implements a SNMP MIB as an SNMP agent.
Is the format/schema of this MIB generic
and specified somewhere ?
Why not maintaining a sensor indicating the status of the OS on the
machine , that can be polled through IPMB / ICMB ?

>>>The Service Availability Forum is working on an
>>>meta-standard API that could be used to group IPMI and other system
>>>management interfaces under one meta-standard.

I there any link on this meta-standard working group ?


______________________________
Eric Chacron
Alcatel - Carrier Network Group
10 rue Latecoere
78140 Velizy France
------------------------------




"Cress, Andrew R" <andrew.r.cress at intel.com>@lists.osdl.org on 10/09/2002
21:29:55

Sent by:  cgl_discussion-admin at lists.osdl.org


To:   cgl_discussion at osdl.org
cc:
Subject:  [cgl_discussion] Project Review: Panic Handler Enhancements



1. Requirements related to the Panic Handler Enhancements project:
---------------------------------------------
4.11 Linux Panic Handler Enhancement
OSDL CGL shall support enriched capabilities on system panic. Currently the
default system panic behavior is to print a short message to the console
and
halt the system. OSDL CGL shall provide a set of configurable functions
including log panic event to system event log as well as the options to
reboot, power down, or power cycle when panic event occurs.

2. How Panic Handler Enhancements meet the CGL requirements:
---------------------------------------
Panic Handler Enhancements includes coverage for the most widely adopted
standard for platform firmware APIs, which is IPMI.  Other platforms can
currently be added to this feature by coding other specific handling for
certain subroutines.  Integration of other platforms will be made easier
when a meta-standard API is published to encompass both IPMI and non-IPMI
systems.  See design information below for functional descriptions.

3. Project design information:
-------------------------------
This feature contains both a kernel module (bmc_panic) and a component
rpm (panicsel) for various utilities.

The bmc_panic kernel module adds additional features to the Linux Panic
Handler so that more information can be saved and passed along if a Linux
panic condition occurs.  This package enables the bmc_panic kernel module
to
handle these additional features.
bmc_panic features:
 - write OS Critical Stop message to firmware System Event Log (SEL)
 - turn on the Critical Alarm LED on the Telco Alarms Panel
 - send SNMP trap via BMC LAN Alerting mechanism

The Panic Handler module (bmc_panic) inserts itself in the panic_notifier
list, then, if a panic occurs, bmc_panic is notified, and it performs
certain functions.  It writes an "OS Critical Stop" message to the firmware
System Event Log, turns on the Critical Alarm LED on the Telco Alarms
Panel,
and sends a BMC LAN Alert via the firmware SNMP capability, even after the
OS is unavailable.  This module contains a portion of the valinux IPMI
driver in order to communicate with the BMC via IPMI, but none of the IPMI
interfaces used by bmc_panic are exposed so that it will not conflict with
any other IPMI driver module that may be loaded by the kernel.

The panicsel utilities below allow the user to access the firmware System
Event Log and configure the Platform Event Filter table for the new OS
Critical Stop records.
showsel        - show the System Event Log records
pefconfig      - show and configure the Platform Event Filter table
                 to allow BMC LAN alerts from OS Critical Stop messages,
                 also shows and sets the BMC LAN configuration parameters
hwreset        - to cause the BMC to hard reset the system
tmconfig       - to set up the BMC Serial port for various modes, such as
                 Terminal Mode (not yet supported in this release).

DEPENDENCIES

The Panic Handler Enhancements currently work with platforms that
support the IPMI standard.  If the platform does not support IPMI, these
changes are inert, but the code could be modified for another system
management interface.  The Service Availability Forum is working on an
meta-standard API that could be used to group IPMI and other system
management interfaces under one meta-standard.  When this becomes
available,
these Panic Handler Enhancements will conform to that API so that non-IPMI
platforms can be integrated more easily.

The Panic Handler enhancements depend on the CONFIG_BMCPANIC kernel
parameter being set in the kernel config file (/usr/src/linux/.config),
in order to export two key variables, and include the bmc_panic module.

The panicsel utilities require an IPMI Driver, either the Intel IPMI
package
(ipmidrvr, /dev/imb) or the valinux IPMI Driver (/dev/ipmikcs).

4. Code location:
-----------------
The bmc_panic kernel patch and the panicsel utilities are located at:
   http://cvs.developer.osdl.org/viewcvs/viewcvs.cgi/components/panicsel/
or
   http://cvs.carrierlinux.org/viewcvs/viewcvs.cgi/components/panicsel/
or
   http://sourceforge.net/projects/panicsel/

_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion







_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion



_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion







More information about the cgl_discussion mailing list