Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

From: Borislav Petkov
Date: Wed Oct 08 2014 - 18:58:07 EST


On Wed, Oct 08, 2014 at 04:52:06PM -0500, Aravind Gopalakrishnan wrote:
> I am not understanding why m.bank is assigned this value..

That's a very good question, see below for some history.

>
> It only causes incorrect decoding-
> [ 608.832916] DEBUG: raise_amd_threshold_event
> [ 608.832926] [Hardware Error]: Corrected error, no action required.
> [ 608.833143] [Hardware Error]: CPU:26 (15:2:0)
> MC165_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000000000
> [ 608.833551] [Hardware Error]: MC165_ADDR: 0x0000000000000000
> [ 608.833777] [Hardware Error]: cache level: RESV, tx: INSN
> [ 608.834034] amd_inject module loaded ...
>
>
> (Obviously, as in amd_decode_mce() we switch (m->bank) for decoding the
> status and there is no bank 165)
>
> OTOH, if m.bank = bank;
> Then we get correct decoding info-

Yes, and I think we should do that only if we're using the *last* error
to report the overflow with: we're reporting a thresholding counter
overflow and the bank on which it was detected on should, of course, be
part of the report.

The "funny" bank is some sort of a software defined banks thing which
got added in 2005 (see the patch I dug out below) and it was supposed
to be used (I'm guessing here) for reporting thermal events using MCA
(dumb idea, if you ask me) so since thermal events don't really have
a bank, they decided to have some sort of a software-defined MCA bank
which doesn't correspond to any hardware bank.

Then Jacob decided to use it for some reason too:

95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")

maybe because thresholding errors don't have a bank associated with them
but if I'm not missing anything, they do!

Oh oh, ok, it just dawned on me! I think I know what it *might* have
been: they wanted to report the overflowing with a special error
signature which uses a software-defined bank. Ok, that actually makes
sense: when you see an error for a sw-defined bank, you're reporting an
thresholding counter overflow.

Which means that we shouldn't be populating m.status either, i.e. what
we did earlier:

rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);

because this is a special error type.

Hmm, it is too late here to think straight, more tomorrow. But Aravind,
that was a very good question, you actually made me dig into git history
:-)

Good night.