Re: [RCF] Linux memory error handling

From: Ross Biro
Date: Wed Jun 15 2005 - 20:04:56 EST


On 6/15/05, Maciej W. Rozycki <macro@xxxxxxxxxxxxxx> wrote:
> On Wed, 15 Jun 2005, Russ Anderson wrote:
>
> >
> > Polling Threshold: A solid single bit error can cause a burst
> > of correctable errors that can cause a significant logging
> > overhead. SBE thresholding counts the number of SBEs for
> > a given page and if too many SBEs are detected in a given
> > period of time, the interrupt is disabled and instead
> > linux periodically polls for corrected errors.
>
> This is highly undesirable if the same interrupt is used for MBEs. A
> page that causes an excessive number of SBEs should rather be removed from
> the available pool instead. Logging should probably take recent events
> into account anyway and take care of not overloading the system, e.g. by
> keeping only statistical data instead of detailed information about each
> event under load.
>

First, SBEs and MBEs are named historically and are currently called
correctable and uncorrectable errors. Modern chip sets can often
handle many incorrect bits in a single word and still correct the
problem. So please don't assume you can make any inferences into the
probability of an MBE because you are seeing SBEs. Any such
inferences would need to be chip set specific.

Some common chip sets have bugs in them that can cause an excessive
number of reported SBEs. On those chip sets with out any error
reporting, there is a noticeable performance hit when the SBE counters
go wild. If every SBE generated an interrupt the system would grind
to a halt. So there needs to be easy ways to disable interrupts
associated with SBEs.

Also some memory/chip set combinations generate a significant number
of SBEs with out any significant danger of an MBE, so many people will
want to ignore SBEs entirely, or only poll once in a while.

Finally, many chip sets have memory scrubbing technology that can
simultaneously generate SBEs in memory not being accessed by the
kernel and fix those errors. So don't just assume that because the
kernel isn't allowing access to a page, you won't see SBEs or MBEs
from that page.

Otherwise, anything done in this direction seems like a good idea to me.

Ross
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/