RE: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models

From: Ghannam, Yazen
Date: Mon Mar 11 2019 - 14:52:49 EST


> -----Original Message-----
> From: linux-edac-owner@xxxxxxxxxxxxxxx <linux-edac-owner@xxxxxxxxxxxxxxx> On Behalf Of Borislav Petkov
> Sent: Monday, March 11, 2019 1:21 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; Borislav Petkov <bp@xxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>; x86@xxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; rafal@xxxxxxxxxx; clemej@xxxxxxxxx
> Subject: Re: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
>
> On Thu, Mar 07, 2019 at 09:26:04PM +0000, Ghannam, Yazen wrote:
> > +static bool smca_filter_mce(struct mce *m)
> > +{
> > + enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
> > + struct cpuinfo_x86 *c = &boot_cpu_data;
> > + u8 xec = XEC(m->status, xec_mask);
> > +
> > + /*
> > + * Spurious errors of this type may be reported.
> > + * See Family 17h Models 10h-2Fh Erratum #1114.
> > + */
> > + if (c->x86 == 0x17 &&
> > + (c->x86_model >= 0x10 && c->x86_model <= 0x2F) &&
> > + bank_type == SMCA_IF && xec == 10)
> > + return true;
>
> This is happening too late and we need it much earlier, from Rafal's dmesg:
>
> [ 1.070855] mce: [Hardware Error]: Machine check events logged
> [ 1.070860] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: d8200000000a0151
> [ 1.070863] mce: [Hardware Error]: TSC 73fa0765c MISC d01b0fff00000000 SYND 4a000000 IPID 100b000000000
> [ 1.071065] mce: [Hardware Error]: PROCESSOR 2:810f10 TIME 1543481411 SOCKET 0 APIC 2 microcode 810100b
>
> that's __print_mce() from the notifier.
>
> So we'd need a filter function which is called in do_machine_check() and
> machine_check_poll() right after we've collected enough info to be able
> to filter out the MCE based on the signature. In this case the extended
> error core and SMCA bank type suffices but we should put those functions
> late enough so that they can be used for other filtering later.
>

Okay, understood.

Should I keep the filter in edac_mce_amd? I guess it's not necessary if filtered out earlier.

> Alternatively, if this error type has a special bit in the mask registers so
> that you can disable it there ala
>
> if (c->x86_vendor == X86_VENDOR_AMD) {
> if (c->x86 == 15 && cfg->banks > 4) {
> /*
> * disable GART TBL walk error reporting, which
> * trips off incorrectly with the IOMMU & 3ware
> * & Cerberus:
> */
> clear_bit(10, (unsigned long *)&mce_banks[4].ctl);
>
>
> that would be even better but I'd guess it doesn't have a special bit...
>

Yes, that's right. Clearing a bit in MCA_CTL is not recommend in this case.

Thanks,
Yazen