RE: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Luck, Tony
Date: Fri Jun 01 2012 - 11:42:54 EST


> Yeah, about that. What are you guys doing about losing CECCs when
> throttling is on, I'm assuming there's no way around it?

Yes, when throttling is on, we will lose errors, but I don't think
this is too big of a deal - see below.

>> Will EDAC drivers loop over some chipset registers blasting
>> out huge numbers of trace records ... that seems just as bad
>> for system throughput as a CMCI storm. And just as useless.
>
> Why useless?

"Useless" was hyperbole - but "overkill" will convey my meaning better.

Consider the case when we are seeing a storm of errors reported. How
many such error reports do you need to adequately diagnose the problem?

If you have a stuck bit in a hot memory location, all the reports will
be at the same address. After 10 repeats you'll be pretty sure that
you have just one problem address. After 100 identical reports you
should be convinced ... no need to log another million.

If there is a path failure that results in a whole range of addresses
reporting bad, then 10 may not be enough to identify the pattern, but
100 should get you close, and 1000 ought to be close enough to certainty
that dropping records 1001 ... 1000000 won't adversely affect your
diagnosis.

[Gong: after thinking about this to write the above - I think that the
CMCI storm detector should trigger at a higher number than "5" that
we picked. That works well for the single stuck bit, but perhaps doesn't
give us enough samples for the case where the error affects a range of
addresses. We should consider going to 50, or perhaps even 500 ... but
we'll need some measurements to determine the impact on the system from
taking that many CMCI interrupts and logging the larger number of errors.]

The problem case is if you are unlucky enough to have two different
failures at the same time. One with storm like properties, the other
with some very modest rate of reporting. This is where early filtering
might hurt you ... diagnosis might miss the trickle of errors hidden by
the noise of the storm. So in this case we might throttle the errors,
deal with the source of the storm, and then die because we missed the
early warning signs from the trickle. But this scenario requires a lot
of rare things to happen all at the same time:
- Two unrelated errors, with specific characteristics
- The quieter error to be completely swamped by the storm
- The quieter error to escalate to fatal in a really short period (before
we can turn off filtering after silencing the source of the storm).

I think this is at least as good as trying to capture every error. Doing
this means that we are so swamped by the logging that we also might not
get around to solving the storm problem before our quiet killer escalates.

Do you have other scenarios where you think we can do better if we log
tens of thousands or hundreds of thousands of errors in order to diagnose
the source(s) of the problem(s)?

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/