Re: [PATCH v2] x86/mce: Add Skylake quirk for patrol scrub reported errors

From: Luck, Tony
Date: Fri Sep 25 2020 - 20:02:33 EST


On Fri, Sep 25, 2020 at 09:19:12PM +0200, Borislav Petkov wrote:
> And after staring at this a bit, it looks like all it wants to do is to
> adjust the severity. And we have a severity grading mechanism. So let's
> see how ugly it would become if we extended it to check that too.
>
> So how's that below instead?

In some ways that's pretty neat. But it would still be ugly if we need
to extend it further for other issues. Especially if they don't have such
a simple rule to adjust the severity.

> It builds here, I haven't even thought about testing it and I might've
> missed out on some aspects but tbh this looks much better to me. Because
> it is not bolted on the handling path but integral part of it.
>
> Thoughts?
>
> ---
> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index e1da619add19..8c1a41aa5e40 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -9,9 +9,11 @@
> #include <linux/seq_file.h>
> #include <linux/init.h>
> #include <linux/debugfs.h>
> -#include <asm/mce.h>
> #include <linux/uaccess.h>
>
> +#include <asm/mce.h>
> +#include <asm/intel-family.h>
> +
> #include "internal.h"
>
> /*
> @@ -40,9 +42,14 @@ static struct severity {
> unsigned char context;
> unsigned char excp;
> unsigned char covered;
> + unsigned char cpu_model;
> + unsigned char cpu_stepping;
> + unsigned char bank_lo, bank_hi;

This would be better as a bit mask. I don't think we need this same
hack on the next generation of CPUs ... but if we did, the bank numbers
that would be affected don't form a continuous sequence.

> char *msg;
> } severities[] = {
> #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c }
> +#define BANK_RANGE(l, h) .bank_lo = l, .bank_hi = h
> +#define MODEL_STEPPING(m,s) .cpu_model = m, .cpu_stepping = s
> #define KERNEL .context = IN_KERNEL
> #define USER .context = IN_USER
> #define KERNEL_RECOV .context = IN_KERNEL_RECOV
> @@ -97,7 +104,10 @@ static struct severity {
> KEEP, "Corrected error",
> NOSER, BITCLR(MCI_STATUS_UC)
> ),
> -
> + MCESEV(AO, "UnCorrected Patrol Scrub Error",
> + NOSER, MASK(0xffffeff0, 0x001000c0),
> + MODEL_STEPPING(INTEL_FAM6_SKYLAKE_X, 4),BANK_RANGE(13,18)
> + ),

I'd need to stare at the placement of this in the sequence of rules at some
non-Friday-afternoon time. It might be right, but as we've grumbled together
many times before that code is full of surprise side effects.

> /*
> * known AO MCACODs reported via MCE or CMC:
> *
> @@ -324,6 +334,12 @@ static int mce_severity_intel(struct mce *m, int tolerant, char **msg, bool is_e
> continue;
> if (s->excp && excp != s->excp)
> continue;
> + if (s->cpu_model && boot_cpu_data.x86_model != s->cpu_model)
> + continue;
> + if (s->cpu_stepping && boot_cpu_data.x86_stepping <= s->cpu_stepping)
> + continue;
> + if (s->bank_lo && (s->bank_lo <= m->bank && m->bank <= s->bank_hi))
> + continue;
> if (msg)
> *msg = s->msg;
> s->covered = 1;

-Tony