Re: [PATCH 2/4] x86/mce/amd: Introduce deferred error interrupt handler

From: Borislav Petkov
Date: Mon May 04 2015 - 11:47:18 EST


On Mon, May 04, 2015 at 10:29:50AM -0500, Aravind Gopalakrishnan wrote:
> >What's the family check for? for BIOSes which don't set the LVT offset
> >to 2, as they should?
> >
> >If so, we probably should say
> >
> > pr_err(FW_BUG "Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.\n");
> >
> >or similar...
>
> Yeah. I meant to provide a comment at least for this.
> Forgot to do that.
>
> I'll print out a error message as you suggested (considering we do this in
> other places like threshold setup or IBS setup..)

lvt_off_valid() does that already. Adding Robert.

> Right. I think a __log_error() is a good idea.
> Except, in amd_threshold_interrupt(), we have-
> m.misc = ((u64)high << 32) | low;
>
> which, is actually useless as we don't use m.misc anywhere in
> amd_decode_mce() or anywhere else in the decoding pipeline AFAICT.
> We only print out if 'misc' is valid and we only need status bits for that-
> ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
>
> But, more importantly, we don't setup 'm.addr' here (in
> amd_threshold_interrupt() or in amd_deferred_error_interrupt())
> Which means anytime we pass an error to be decoded from the interrupt
> handlers, we don't get any info about the error address.

So what are we reporting with a deferred error if it is not a
full-fledged MCE? We better fix that otherwise we probably shouldn't
even report those. I mean, userspace is supposed to do some error
handling based on error info but if that info's missing, we might just
as well panic right then and there, right?

> So, we can do one of these-
> 1. Remove m.misc setup in amd_threshold_interrupt() and
> rdmsrl(MSR_IA32_MCx_ADDR(bank), m.addr) before we call mce_log()
> 2. Since we have mce_read_aux() that reads misc and addr registers, we can
> move the mce_[rd|wr]msrl wrappers and mce_read_aux() into mce.h and use it
> here in mce_amd.c
>
> Thoughts?

Makes sense but you need to first check though, which registers are
valid in the hw when a threshold/deferred error happens and collect
them. Only then we can do proper recovery.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/