Re: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors

From: Borislav Petkov
Date: Tue Jun 16 2020 - 15:30:04 EST


On Mon, Jun 15, 2020 at 11:40:56AM -0700, Tony Luck wrote:
> From: Youquan Song <youquan.song@xxxxxxxxx>
>
> Skylake has a mode where the system administrator can use a BIOS setup
> option to request that the memory controller report uncorrected errors
> found by the patrol scrubber as corrected. This results in them being
> signalled using CMCI, which is less disruptive than a machine check.
>
> Add a quirk to detect that a "corrected" error is actually a downgraded
> uncorrected error with model specific checks for the "MSCOD" signature in
> MCi_STATUS and that the error was reported from a memory controller bank.
>
> Adjust the severity to MCE_AO_SEVERITY so that Linux will try to take
> the affected page offline.
>
> [Tony: Wordsmith commit comment]
>
> Signed-off-by: Youquan Song <youquan.song@xxxxxxxxx>
> Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> ---
> arch/x86/kernel/cpu/mce/core.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index e9265e2f28c9..0dbd0a21a0bf 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -123,6 +123,8 @@ static struct irq_work mce_irq_work;
>
> static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
>
> +static void no_adjust_mce_log(struct mce *m) {};
> +static void (*adjust_mce_log)(struct mce *m) = no_adjust_mce_log;
> /*
> * CPU/chipset specific EDAC code can register a notifier call here to print
> * MCE errors in a human-readable form.
> @@ -772,6 +774,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> if (mca_cfg.dont_log_ce && !mce_usable_address(&m))
> goto clear_it;
>
> + adjust_mce_log(&m);
> mce_log(&m);

Two things: can that error type be detected when #MC gets raised, i.e., in
do_machine_check() as part of scanning all banks?

If so, then the adjusting needs to happen inside mce_log().

Also, that assignment to the function pointer doesn't make much sense to
me and I think you should do the vendor/family/model checking straight
in a function adjust_mce_log() which gets called by whoever...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette