RE: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Luck, Tony
Date: Thu May 31 2012 - 16:52:48 EST


> It could be very quiet (i.e., machine runs with no errors) and it could
> have bursts where it reports a large number of errors back-to-back
> depending on access patterns, DIMM health, temperature, sea level and at
> least a bunch more factors.

Yes - the normal case is a few errors from stray neutrons ... perhaps
a few per month, maybe on a very big system a few per hour. When something
breaks, especially if it affects a wide range of memory addresses, then
you will see a storm of errors.

> So I can imagine buffers filling up suddenly and fast, and userspace
> having hard time consuming them in a timely manner.

But I'm wondering what agent is going to be reporting all these
errors. Intel has CMCI - so you can get a storm of interrupts
which would each generate a trace record ... but we are working
on a patch to turn off CMCI if a storm is detected. AMD doesn't
have CMCI, so errors just report from polling - and we have a
maximum poll rate which is quite low by trace standards (even
when multiplied by NR_CPUS).

Will EDAC drivers loop over some chipset registers blasting
out huge numbers of trace records ... that seems just as bad
for system throughput as a CMCI storm. And just as useless.

General principle: If there are very few errors happening then
it is important to log every single one of them. If there are
so many that we can't keep up, then we must sample at some level,
and we might as well do that at generation point.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/