RE: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

From: Luck, Tony
Date: Wed Feb 29 2012 - 11:58:49 EST


> - severity: No real need for it. If the error is severe enough, the
> kernel handles automatically, i.e. memory poisoning and recovery. In all
> the other cases it is not severe enough.

We'll never see fatal errors via the perf/tracepoint (no way the RAS daemon
will run to pull them). But we will see both corrected error chatter and
recovered uncorrectable errors. I would be able to tell these apart.
Corrected errors in small doses are normal and don't require any
action beyond logging so you can see whether there are enough to cross
a threshold and cause alarm. Recovered uncorrectable errors are going
to be much rarer, and I think deserve closer scrutiny - even when there
is just one of them.
If you drop the severity field, is there some other way to make this
distinction?

> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
> Cache error during an L1 linefill from L2, what the f*ck does the
> silkscreen label mean for such an error?! Well, nobody knows wtf it
> means!</sarcasm>

Cache error should point to a cpu socket - I'd like to have a silk
screen label for that (are they numbered "0, 1, 2 ..." on the motherboard
or "1, 2, 3 ..."?) No idea where we'd get that information from. dmidecode
shows "Socket Designation: CPU 1" (and "2") for my current Sandy Bridge
system. I'd have to pull the system apart to see if those are helpful
in identifying which physical cpu is which.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/