Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

From: Mauro Carvalho Chehab
Date: Wed Feb 29 2012 - 12:17:40 EST


Em 29-02-2012 11:40, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 11:04:09AM -0300, Mauro Carvalho Chehab wrote:
>> No, you didn't. Every time i touch on this point, you just say that it
>> doesn't fit without giving any explanation why not.
>
> Let me explain it to you one _last_ time:

Thanks! Your view is now clear.
>
> - severity: No real need for it. If the error is severe enough, the
> kernel handles automatically, i.e. memory poisoning and recovery. In all
> the other cases it is not severe enough.

I see your point, but opone thing is to recover from severe errors;
another thing is to properly report it.

In general, when an error occurs, what users do is to account
them, taking other measures (like replace the affected hardware)
only if the error count is above a certain threshold.

The threshold criteria for non-severe errors is generally different from
the criteria for severe ones. That's why the severity information should
be reported.

>
> - location: this is contained in the ->cpu field.

Assuming that the CPU field contains the location of the error is a bad
assumption. On SB, the CPU that reports a memory error is the CPU where
some code tried to access the RAM, and not the CPU where the memory
controller is. So, the value for CPU is bogus for those error types.
This is also true, at least on Intel, for other types of errors, like
bus/interconnect ones: the CPU that reports the error is the one trying
to access the bus.

Also, on almost all memory error cases, the location of the affected
component is not the memory controller at the CPU. Instead, it is the
DRAM chip, located inside a DIMM.

So, while there are several error types where the location is cpu field,
there are also several other cases where location != cpu.

The kernel decoder knows the error location, on most cases. So, instead
of letting the userspace to guess the error location, it should report
what it was decoded.

> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
> Cache error during an L1 linefill from L2, what the f*ck does the
> silkscreen label mean for such an error?! Well, nobody knows wtf it
> means!</sarcasm>

It means what component needs to be replaced, because there are too
many errors there, and it is likely damaged.

Silkscreen label for a L1 error: CPU0, CPU1, CPU2, CPU3 (the socket "name"
for the CPU socket at the motherboard, as labeled at the silkscreen).
Again, this is not the CPU field, as one CPU socket has several cores,
and the core/socket ID order can be different than the actual CPU slots
(btw, this is explicitly noticed on a few Intel datasheets).

Of course, for both location and silkscreen label fields, if, for any
reason, the location of the affected component can't be identified, those
fields should be filled with an empty string (or with something like "unknown").

Silkscreen label for a memory error: DIMM1A, DIMM2B, etc.

> - error_msg: already there in my patch.
>
> So go and read and try _understanding_ this before you come back with
> more crap, ok?

Ok, but please do the same and try to see the question from the users
perspective that needs to know what is the damn broken FRU
(Field Replaceable Unit) that needs a replacement on their
critical systems.
>
>> Running away from this discussion won't help at all.
>
> Not running away - trying not to waste time with bullshit.
>

Thanks,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/