Re: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Borislav Petkov
Date: Tue May 29 2012 - 07:58:19 EST


On Thu, May 24, 2012 at 03:00:53PM -0300, Mauro Carvalho Chehab wrote:
> On the current drivers, the grain static. I'm not sure if the grain is really
> a per-memory controller or if this is again yet-another-issue with the way
> EDAC core handles such information.
>
> I suspect that, on sophisticated memory controllers that can do any type of
> DIMM interleaving, including no interleave at all, the grain can vary from
> one memory address range to the other.

Ah, you suspect. Well, since you suspect, then it has to be true.

Granularity of reported error doesn't have anything direct to do with
memory interleaving.

> If we change the API to have an explicit sysfs node to express the grain,
> and latter we end by needing a per-address range grain, we'll need to break
> the kABI.
>
> So, keeping the grain information at the tracepoint is more flexible, as it
> can cover both situations.

And adding useless fields is bloating it.

> >>> But the more important question is: does the grain help us when handling
> >>> the error info in userspace?
> >>>
> >>> It tells us that at this physical address with "grain" granularity we
> >>> had an error. So?
> >>
> >> While a certain number of corrected errors that happened on different, sparsed,
> >> addresses may not mean a damaged memory, the same number of corrected errors
> >> happening at the same physical address/grain means that the DRAM chip that
> >> contains such address is damaged, so the corresponding DIMM needs to be
> >> replaced.
> >>
> >> So, the address/grain can be used by userspace algorithms to increase the
> >> probability that a DIMM is damaged.
> >
> > I have no idea what you're saying here.
> >
> > The DIMM can be pinpointed using the address only, why do you need the
> > grain too?
>
> You can pinpoint a DIMM but in order to pinpoint the affected MOSFET transistors,

The MOSFET transistors, every single one of them??! Wohahahah, this just
made my day!

> the address and address mask is needed, as most memory controllers can't point
> to a single address, because the register that stores the address doesn't have
> enough bits to store the full content of the instruction pointer register, or because
> of some other internal device issues.
>
> So, two different "addresses" could atually point to the same group of transistors
> inside a DIMM.
>
> Also, higher values of grains may affect the error statistics. For example, i3200_edac
> driver has a grain that can be 64 MB, while other devices have a grain of 1.

I think you mean

#define I3200_TOM_SHIFT 26 /* 64MiB grain */

which is the Top-Of-Memory shift value. How is that grain in the sense of error
granularity I can't fathom.

Oh, and by the way, this define is unused and can be removed.

So, to sum up, I'm still completely unconvinced 'grain' is needed so remove it.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/