Re: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Mauro Carvalho Chehab
Date: Thu May 31 2012 - 14:24:34 EST


Em 31-05-2012 14:20, Borislav Petkov escreveu:
> On Thu, May 31, 2012 at 04:51:27PM +0000, Luck, Tony wrote:
>> No, it's a 6-bit field used as a shift ... so if it has value "6", it
>> means cache line granularity. Value "12" would mean 4K granularity.
>> Architecturally it could say "30" to mean gigabyte, or even "63" to
>> mean "everything is gone".
>
> Right, 0x3f are 6 bits, correct, doh!
>
>>>> while a few (IIRC patrol scrub) will report with page (4K)
>>>> granularity. Linux doesn't really care - they all have to get rounded
>>>> up to page size because we can't take away just one cache line from a
>>>> process.
>>>
>>> I'd like to see that :-)
>>
>> Patrol scrub works inside the depths of the memory controller on rank/row
>> addresses, not on system physical addresses. When it finds a problem, a
>> reverse translation is needed to be able to report a system physical
>> address in MCi_ADDR. Getting all the bits right is apparently a hard thing
>> to do, so the MCI_MISC_ADDR_LSB bits are used to indicate that some low
>> order bits are not valid.
>
> Ok, thus the dynamic granularity. But we're going to end up reporting
> rank and row too so that it can be matched to the DIMM. I consider
> physical address a bonus in such cases and it is only of importance to
> those who like to replace single DRAM chips or single MOSFET transistors
> :-) :-) :-).
>

A single corrected error doesn't mean you need to replace anything. The need
for a replacement is due to a joint probability of several independent
events:
- a random noise;
- a failure on a MOSFET transistor;
- a failure at the DIMM contacts.

In order to distinguish between them, you need to know the statistics of
each of the above stochastic process and use some correlation functions
to detect to each group of event a series of error belongs.

For example, the error address at the DIMM contacts can be given by a
constant random variable, affecting a group of bits at the syndrome,
while a failure at a group of MOSFET transistors will be given by a
(series) of degenerate distribution function.

By properly exporting the address/grain/syndrome, an userspace program
can filter random noise failures from a defect at a DRAM or a bad contact
issue at the DIMM, and use different error count limits for each type of
error, when telling userspace when a memory should be replaced or not.

Regards,
Mauro

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/