Re: [RFC PATCH 00/21 v2] amd64_edac: EDAC module for AMD64

From: Doug Thompson
Date: Thu Apr 30 2009 - 10:39:48 EST



--- On Thu, 4/30/09, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:

> From: Andi Kleen <andi@xxxxxxxxxxxxxx>
> Subject: Re: [RFC PATCH 00/21 v2] amd64_edac: EDAC module for AMD64
> To: "Doug Thompson" <norsk5@xxxxxxxxx>
> Cc: "Andi Kleen" <andi@xxxxxxxxxxxxxx>
> Date: Thursday, April 30, 2009, 1:05 AM
> > The problem we have had is once
> > an Uncorrected Error fires and dumps the address, mapping it
> > to the DIMM silk screen label is difficult, especially in
> > user space, in gaining access to the registers of the
> > controller. 
>
> You can just do it either after reboot or in the crash
> kernel. I don't
> think it's required to put it all in kernel. Also you don't
> really
> need access to the registers;

Actually, according to AMD, their reference code for mapping from an error address to a memory slot does require access to the controller's registers. On page 67 of the BKDG for family F10 from their website is 2 and 1/2 pages of the code to perform that mapping. It takes into consideration interleaving of all kinds, etc. It is narly to say the least.

> SMBIOS provides this
> information and
> mcelog knows how to convert it.

As I undestand SMBIOS it provides a linear assignment of basic memory starts and lengths but does not provide the memory controller context as AMD's reference code takes into consideration

>
> Trying to add other consumers to mce.c will be likely very
> messy;
> there's really no generic way to do it. I hope you're not
> planning
> turning the nicely CPU independent code in mce.c into a
> mess
> of twisty CPU specific passages like the old 32bit code
> was.
>
> -Andi

No, not at all. Keeping the "clean" code is paramount, but we are seeking for an interface to accept the MCE error register structure and map that information to at least a DIMM label field, if not more.

The EDAC module would register for that interface upon loading and unregister upon module unload.

The MCE code would call a stub routine that either returns no mapping occurred OR call the EDAC mapper. MCE could then determine from that return code if a mapping occurred or not. If it did, then display the desired information, otherwise proceed as normal.

doug t

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/