Re: Hardware Error Kernel Mini-Summit

From: Andi Kleen
Date: Mon Jun 14 2010 - 16:36:26 EST

On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote:

Hi Eric,

> - The current EDAC code displays which DIMMS you have plugged
> in so you can tell if you unplug one, if it was the DIMM
> you were aiming at.

Binary search for bad DIMMs. The way to handle memory errors in
the 21th century.

Obviously that does not really work, especially not on large
memory systems.

> > On a lot of modern systems I checked DMI
> > seems reasonably accurate in terms of layout, so I suspect they can
> > be handled with this. For others probably
> > still need some special driver, but one
> > with a proper interface.
> DMI is great on the days it works, there is a lot of variations
> between BIOS's. Also if the information is decent it can be
> used to inform the current EDAC code as well as anything else.

No DMI layout is unfortunately difficult to map to EDAC layout.
That's mostly EDAC's fault actually.

A sane EDAC replacement could be fed from DMI.

> You mean an interface that doesn't report the error so people
> won't complain to you about a near useless kernel error
> message.

DMI[1] does not report the errors, the errors are in machine checks
(or possibly other non architectural registers)
DMI just gives you enumeration. It doesn't give everything,
but it's reasonably complete at least.

[1] except for the event log, but I'm not proposing to use that.
> Setting the scrub rate isn't half so interesting as displaying
> it.

I still would like to understand the idea behind this varying
at all. If you have any deeper thoughts on this please send them.

> Having basic hardware information displayed in sysfs seems to be the
> design of the rest of linux. I don't see abandoning that part of the
> EDAC design as wise.
> Displaying the fact that ECC is turned on in the hardware is one
> of the more interesting bits. That at least allows you to verify
> that things are working.

There are hundreds to thousands of BIOS level hardware knobs for memory
configuration (and if you count all BIOS knobs for everything far more)

Why do you want to check a single bit only? (which is actually not
a single bit but also a lot of different ways to set this)

I can see there's a need to check that BIOS are doing the right
thing, but you'll never get that from a few sysfs fields.
You need a proper tool that is written for the system in question.

ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at