Re: x86/mce merge, integration hickup + crash, design thoughts

From: Russ Anderson
Date: Wed Dec 31 2008 - 13:09:35 EST

Next message: Sam Ravnborg: "Re: [PATCH] parisc: fix module loading failure of large kernel modules (take 4)"
Previous message: Jiri Pirko: "Re: [PATCH for -mm] getrusage: fill ru_maxrss value"
In reply to: Andi Kleen: "Re: x86/mce merge, integration hickup + crash, design thoughts"
Next in thread: Andi Kleen: "Re: x86/mce merge, integration hickup + crash, design thoughts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Dec 31, 2008 at 02:32:07PM +0100, Andi Kleen wrote:
> Russ Anderson wrote:
>
> >Summary ASCII information is useful, especially if the error
> >is clearly a hardware error. Andi is right that decoding the
> >information to print the specific failing hardware (ie which
> >DIMM) may be too dificult to decode on the way down. It would
> >be great to identify the failing hardware component on the
> >way down, when possible.
>
> This will hopefully happen in the future. In fact mcelog has support
> to decode using DMI tables, but it turns out this doesn't work
> very well in practice (both because of BIOS problems and because
> the DMI standard was not really designed for this). That is why
> I wouldn't advocate right now to move this code from mcelog
> into the kernel. This might change later.

Yes. The way ia64 deals with this is the SAL (bios) does much of the
platform specific interpretation to identify the failing hardware
and passes that information up to linux. That allows the kernel
to be more platform independent while the SAL (bios) deals with the
platform specifics.

The problem of not having a spec, such as the ia64 SAL spec, is
that other interfaces end up getting overloaded, such as
the DMI tables. Should this be part of the EFI standard?
It really does need to be part of a standard specification.

> >>>It turns out that users don't really find this more enlightening (most
> >>>users have no clue what a Northbridge is). They think it's some kind of
> >>>kernel bug even with the HARDWARE ERROR header.
> >>You should not assume that administrators/users reading kernel crash
> >>messages are dumb. (an ordinary user wont see it most of the time anyway)
> >>
> >>The usage patterns i see is that admins who get an MCE crash often fail
> >>to write down the whole MCE message (not realizing that it is important)
> >>and have to go back and reproduce the MCE crash once again before they
> >>can get any meaningful information.
> >
> >This is why saving the error records to MVRAM is so useful.
> >After reboot the records can be read, formatted, and logged.
>
> I've been looking at using EFI runtime services for this, but it's

I believe that is the right approach. Error handling was one
of my primary motivations for getting in EFI runtime services
for UV. Having a more standard interface, rather than platform
specific, would be good.

> also somewhat problematic (e.g. getting the messages out again in
> a useful way without risking duplicating events)

In ia64, salinfo_decode looks for duplicate records when reading
from SAL (bios). SGI SAL uses hardware timestamps when building
error records to avoid duplicates. There are still occasional
duplicate records in a few cases, but rare enough to not be a
problem.

One case that always generates duplicate records is a recovered
MCA. The MCA record is logged to NVRAM before linux tries to
recover (in case the system crashes). After recovery, the MCA
record is logged with a recovered status, making the record look
different than the record logged to NVRAM (to salinfo_decoded).
salinfo_decoded logs both the original MCA record and the MCA
record marked recovered (writing them to /var/log/salinfo/decoded).
It turned out to be useful to have both records when explaining
to customers how error recovery worked (in the case of an
application hitting a memory uncorrectable error). When they
realize that each duplicate MCA record represents a time that the
system did not crash, they suddenly like those duplicate records. :-)

The bigger concern is not losing records (especially with a burst
of errors). The issue is keeping the older (more significant)
error records and dropping the newer (less significant), not
to be confused with thresholding correctable errors...

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@xxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sam Ravnborg: "Re: [PATCH] parisc: fix module loading failure of large kernel modules (take 4)"
Previous message: Jiri Pirko: "Re: [PATCH for -mm] getrusage: fill ru_maxrss value"
In reply to: Andi Kleen: "Re: x86/mce merge, integration hickup + crash, design thoughts"
Next in thread: Andi Kleen: "Re: x86/mce merge, integration hickup + crash, design thoughts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]