Re: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Borislav Petkov
Date: Fri Jun 01 2012 - 12:00:20 EST


On Fri, Jun 01, 2012 at 03:42:54PM +0000, Luck, Tony wrote:
> > Yeah, about that. What are you guys doing about losing CECCs when
> > throttling is on, I'm assuming there's no way around it?
>
> Yes, when throttling is on, we will lose errors, but I don't think
> this is too big of a deal - see below.
>
> >> Will EDAC drivers loop over some chipset registers blasting
> >> out huge numbers of trace records ... that seems just as bad
> >> for system throughput as a CMCI storm. And just as useless.
> >
> > Why useless?
>
> "Useless" was hyperbole - but "overkill" will convey my meaning better.
>
> Consider the case when we are seeing a storm of errors reported. How
> many such error reports do you need to adequately diagnose the problem?
>
> If you have a stuck bit in a hot memory location, all the reports will
> be at the same address. After 10 repeats you'll be pretty sure that
> you have just one problem address. After 100 identical reports you
> should be convinced ... no need to log another million.

Yeah, we want to have sensible thresholds for this, after which the
(n+1)-st error reported at the same address offlines the page.

> If there is a path failure that results in a whole range of addresses
> reporting bad, then 10 may not be enough to identify the pattern,
> but 100 should get you close, and 1000 ought to be close enough to
> certainty that dropping records 1001 ... 1000000 won't adversely
> affect your diagnosis.

Right, so I've been thinking about collecting error addresses in
userspace (ras daemon or whatever) with a leaky bucket counter which,
when reaching a previously programmed threshold, offlines the page.

This should hopefully mitigate the error burst faster and bring back
CMCI from polling to normal interrupts.

> [Gong: after thinking about this to write the above - I think that the
> CMCI storm detector should trigger at a higher number than "5" that we
> picked. That works well for the single stuck bit, but perhaps doesn't
> give us enough samples for the case where the error affects a range
> of addresses. We should consider going to 50, or perhaps even 500 ...
> but we'll need some measurements to determine the impact on the system
> from taking that many CMCI interrupts and logging the larger number of
> errors.]

And I'm thinking that with proper, proactive page offlining triggered
from userspace you probably might need the throttling in the kernel on
only very rare, bursting occasions ...

> The problem case is if you are unlucky enough to have two different
> failures at the same time. One with storm like properties, the other
> with some very modest rate of reporting. This is where early filtering
> might hurt you ... diagnosis might miss the trickle of errors hidden by
> the noise of the storm. So in this case we might throttle the errors,
> deal with the source of the storm, and then die because we missed the
> early warning signs from the trickle. But this scenario requires a lot
> of rare things to happen all at the same time:
> - Two unrelated errors, with specific characteristics
> - The quieter error to be completely swamped by the storm
> - The quieter error to escalate to fatal in a really short period (before
> we can turn off filtering after silencing the source of the storm).

Yeah, that's nasty. I don't think you can catch a case like that where
an error turns into UC under the threshold value...

If you consume it, you kill the process, if it is in kernel space, you
really have to pack your bags and hang on to your hat.

> I think this is at least as good as trying to capture every error.
> Doing this means that we are so swamped by the logging that we also
> might not get around to solving the storm problem before our quiet
> killer escalates.

Yessir.

> Do you have other scenarios where you think we can do better if we
> log tens of thousands or hundreds of thousands of errors in order to
> diagnose the source(s) of the problem(s)?

My only example is by counting the errors in userspace and using a leaky
bucket algo to decide when to act by offlining pages or disabling hw
components.

This is why I'm advocating the userspace - you can implement almost
anything there - we only need the kernel to be as thin and as fast when
reporting those errors so that we can have the most reliable and full
info as possible. The kernel's job is only to report as many errors
as it possibly can so that userspace can create a good picture of the
situation.

Then, it should act swiftly when disabling those pages so that the
kernel can get back to normal operation as fast as possible.

If we decide - for whatever reason - that we need a different policy, we
can always hack it up quickly in the ras daemon.

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/