Re: [RFC PATCH 0/8] EDAC, mce_amd: Add a tracepoint for the decoded error

From: Borislav Petkov
Date: Thu Jul 27 2017 - 03:59:41 EST


On Thu, Jul 27, 2017 at 09:10:34AM +0200, Ingo Molnar wrote:
> Looks pretty nice to me conceptually. Do you have a couple of examples of
> real-life events that get logged? It's hard to decode it from the new tracepoint
> alone.

Here's what comes out in dmesg:

[ 932.370319] mce: [Hardware Error]: Machine check events logged
[ 932.374474] [Hardware Error]: Corrected error, no action required.
[ 932.381684] [Hardware Error]: CPU:1 (0:0:0) MC5_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc00410000020f0f
[ 932.384256] [Hardware Error]: Error Addr: 0x0000000056071033 [Hardware Error]: TSC: 2703436211649
[ 932.386608] [Hardware Error]: MC5 Error: AG payload array parity error.
[ 932.388425] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)

(whoops, that TSC thing should be on a new line).

and the TP dumps only the last two lines:

[ 932.386608] [Hardware Error]: MC5 Error: AG payload array parity error.
[ 932.388425] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)

but come to think of it, it should dump only the MC? Error line because
the last line can be easily deduced from the error code. I'll change
that.

Btw, the reason why I'm dumping only MC? line is to keep the string
going into the TP relatively small. It is 128 bytes now. I tried dumping
the whole decoded string but that easily overflowed 256 bytes and 256
bytes is already a bit too much to log into the trace buffers.

So I'm concentrating only on the not-very-trivial stuff to decode.

The rest is being deduced directly from the MCi_STATUS value anyway
which we can easily do in userspace and that is straightforward. And
that u64 value we already dump with trace_mce_record().

So the idea is, userspace opens trace_mce_record() to get the raw MCE
data and then this second TP to get the decoded string of what that
error is.

Later, we could extend that same behavior to Intel for the common
errors, at least, so that we can dump at least *some* string explaining
what the error is.

Anyway, something like that is swirling in my head right now...

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--