[cgl_discussion] Project Review: MCA Handler

Pallipadi, Venkatesh venkatesh.pallipadi at intel.com
Tue Oct 29 10:56:21 PST 2002

Hi Steve,

IA32 MCA error _typically_ is due to a critical hardware error (bus, cache,
etc) and is not recoverable. It is signaled to OS through an exception and
normally is followed by a panic/reboot of the system. The best thing we can
do at this juncture is to log all the errors through printks and to event
log and help the administrator to identify the problem and prevent it in
future. Logging the MCA errors through interfaces like /proc will not be
useful, as typical MCA error is followed by a system reboot.

There is a sub class of MCA errors introduced in P4, which are corrected MCA
errors. They are not signaled through exception. OS can find these corrected
MCAs through periodic polling of MCA related MSRs. Using /proc may be useful
in this subset of MCA errors. But, I think, even here logging and letting
event log/notify to handle MCA error is a cleaner interface, than asking
user/daemon to check /proc periodically.


-----Original Message-----
From: Steven Dake [mailto:sdake at mvista.com]
Sent: Monday, October 28, 2002 10:09 AM
To: Pallipadi, Venkatesh
Cc: 'Randy.Dunlap'; 'cgl_discussion at lists.osdl.org'
Subject: Re: [cgl_discussion] Project Review: MCA Handler


Have you considered exporting the information through some typical ramfs 
style interface, such as /proc ?


Pallipadi, Venkatesh wrote:

>Hi Randy,
>Current MCA handler does decode the STATUS information during an MCA error
>and gives some generic error details (like TLB error, cache error etc) to
>the user. But we feel that providing the complete error information
>(decoding of MISC, ADDR and looking at whether this happened in one
>particular CPU or all the CPUs, etc) can be done better by a user tool or
>event log manager rather than in the kernel. We are logging the contents of
>all the MCA related registers (including MISC register) at the time of an
>MCA error, which can then be used by the user level MCA error information
>-----Original Message-----
>From: Randy.Dunlap [mailto:rddunlap at osdl.org]
>Sent: Tuesday, October 22, 2002 5:16 PM
>To: Pallipadi, Venkatesh
>Cc: 'cgl_discussion at lists.osdl.org'
>Subject: Re: [cgl_discussion] Project Review: MCA Handler
>On Thu, 3 Oct 2002, Pallipadi, Venkatesh wrote:
>| Requirements related to MCA Handler project
>| -------------------------------------------
>| Requirement: 4.5 Platform Signal Handler
>| How MCA Handler meets the CGL requirements
>| ------------------------------------------
>| This patch adds the MCA error info. onto the event log, in the format
>| specified by PSH - Event log interface spec.
>| Project design information
>| --------------------------
>| This project adds a kernel patch to:
>| 1) Log the MCA errors onto event log as per the format defined in
>| Log spec.
>| 2) Add the support for logging the additional information available
>| an MCA in P4 based system.
>| Code location
>| -------------
>| The kernel patch for MCA Handler is located in the cgl development tree
>| under kernel/linux-2.4.18/patches/mca_log
>| _______________________________________________
>Hi Venkatesh,
>For Pentium 4, I think that you could add some real value to MCE reporting
>by decoding the MISC register bits to provide some useful information
>to users or whoever is trying to support a system after an MCE
>event occurs.  How about it?

More information about the cgl_discussion mailing list