Re: Hardware Error Kernel Mini-Summit

From: Borislav Petkov
Date: Tue Jun 15 2010 - 06:00:35 EST


From: Nils Carlson <nils.carlson@xxxxxxxxxxx>
Date: Tue, Jun 15, 2010 at 04:06:33AM -0400

> On Tue, 15 Jun 2010, Andi Kleen wrote:
>
> > On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:
> >
> > Hi Doug,
> >
> > >
> > > Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.
> >
> > The way I envision it to working is that a abstracted dimm interface
> > (or edac2 or whatever you want to call it) can be fed from any reasonable
> > DIMM layout driver. This could be either DMI on x86 or some other
> > driver. There would be nothing really x86 specific about that.
>
> Could you maybe provide some references on how DIMM layout
> could be read from DMI? I can't find anything nearly this specific,
> or is it something we're expecting to happen in future BIOS's?
>
> Also, there would probably need to be some standard describing
> different DIMM layouts in general, though maybe such a thing exists.
>
> In other words, there would be have to be some way of ascertaining
> that the info you read from DMI is sufficient to decode MCEs so that
> a faulting DIMM can be identified. In an ideal world, this could
> be tested by some simple tool that could be run by the BIOS writers
> to test that they're providing the OS with sufficient info.

You cannot decode an ECC to a DIMM only using DMI info - at least on AMD
you cannot. The MCE contains the physical address where the ECC happened
and you need EDAC to convert this to a chip select row. Additionally,
you need the error syndrome depending on the dram controllers addressing
mode used.

Now, after you have the chip select row, you need to map this to a DIMM
rank and in order to do that, you need the DIMM info which is in the
SPD ROM (one of the data in the SPD is the DIMM rank which is needed
to unambiguously pinpoint which DIMM is generating those errors). Then
you can use the DMI info - assuming it contains the correct silk screen
labels on the motherboard - to map to a DIMM.

What currently EDAC does is decode the ECC to a chip select - what we
need is some I2C/SMBus code which can read the SPD ROM. I haven't had
the time to look into it yet, though.

--
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/