Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

From: Borislav Petkov
Date: Wed Feb 29 2012 - 12:16:46 EST


On Wed, Feb 29, 2012 at 04:58:09PM +0000, Luck, Tony wrote:
> > - severity: No real need for it. If the error is severe enough, the
> > kernel handles automatically, i.e. memory poisoning and recovery. In all
> > the other cases it is not severe enough.
>
> We'll never see fatal errors via the perf/tracepoint (no way the RAS daemon
> will run to pull them). But we will see both corrected error chatter and
> recovered uncorrectable errors. I would be able to tell these apart.
> Corrected errors in small doses are normal and don't require any
> action beyond logging so you can see whether there are enough to cross
> a threshold and cause alarm. Recovered uncorrectable errors are going
> to be much rarer, and I think deserve closer scrutiny - even when there
> is just one of them.
> If you drop the severity field, is there some other way to make this
> distinction?

Err, MCi_STATUS bits like bit 55 (Action Required) and 56 (Signaled #MC)
in your case...?

> > - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
> > Cache error during an L1 linefill from L2, what the f*ck does the
> > silkscreen label mean for such an error?! Well, nobody knows wtf it
> > means!</sarcasm>
>
> Cache error should point to a cpu socket - I'd like to have a silk
> screen label for that (are they numbered "0, 1, 2 ..." on the motherboard
> or "1, 2, 3 ..."?) No idea where we'd get that information from. dmidecode
> shows "Socket Designation: CPU 1" (and "2") for my current Sandy Bridge
> system. I'd have to pull the system apart to see if those are helpful
> in identifying which physical cpu is which.

First of all, silkscreen label denotes DIMM slots in this context
AFAICT. Concerning CPU sockets, I'm not aware of a method to read out
the silkscreen labels at the CPU sockets, are you? Or am I missing
something?

IOW, we want to assume that cores 0, 1, 2 ... k-1 are on node 0; k, k+1
... 2k-1 belong to node 1, etc., where k is the number of cores on a
socket and thus we have a regular core enumeration on the box.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/