RE: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.

From: Ghannam, Yazen
Date: Thu Nov 07 2019 - 08:47:59 EST


> -----Original Message-----
> From: linux-edac-owner@xxxxxxxxxxxxxxx <linux-edac-owner@xxxxxxxxxxxxxxx> On Behalf Of Borislav Petkov
> Sent: Thursday, November 7, 2019 5:39 AM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
>
> On Wed, Nov 06, 2019 at 08:54:17PM +0100, Borislav Petkov wrote:
> > which are also two attempts.
> >
> > Anyway, I'll queue your set and I'll try to debug that thing because it
> > is getting on my nerves slowly...
>
> Yah, the problem is that because we have:
>
> MODULE_DEVICE_TABLE(x86cpu, amd64_cpuids);
>
> it gets tried on each CPU because an uevent gets dispatched for each
> device, and each CPU is a device.
>
> That's why I see it twice on this box - it has two CPUs.
>

Okay, that's makes sense.

BTW, what do you think about loading based on PCI devices? The module
used to do this. I ask because I'm starting to see that future systems may
re-use PCI IDs, and this indicates the same level of hardware support.

> And Greg says making it attempt once per system can't be done. Unless we
> start doing hacks with sending uevents per BSP only which is too much.
> Or we can remember the previous return value of the module init function
> into edac_core but that's nasty too.
>
> I'm thinking we should simply kill this fat ecc_msg thing which is not
> very useful and be done with it:
>
> [ 5.697275] EDAC MC: Ver: 3.0.0
> [ 5.909530] EDAC amd64: F10h detected (node 0).
> [ 6.345231] EDAC amd64: Node 0: DRAM ECC disabled.
> [ 6.370815] EDAC amd64: F10h detected (node 0).
> [ 6.370929] EDAC amd64: Node 0: DRAM ECC disabled.
>
> That's probably still a bit annoying on a large machine but better than
> nothing.
>

Yeah, I agree.

> ---
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 3aeb5173e200..0738237e3f09 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3188,18 +3188,6 @@ static void restore_ecc_error_reporting(struct ecc_settings *s, u16 nid,
> amd64_warn("Error restoring NB MCGCTL settings!\n");
> }
>
> -/*
> - * EDAC requires that the BIOS have ECC enabled before
> - * taking over the processing of ECC errors. A command line
> - * option allows to force-enable hardware ECC later in
> - * enable_ecc_error_reporting().
> - */
> -static const char *ecc_msg =
> - "ECC disabled in the BIOS or no ECC capability, module will not load.\n"
> - " Either enable ECC checking or force module loading by setting "
> - "'ecc_enable_override'.\n"
> - " (Note that use of the override may cause unknown side effects.)\n";
> -
> static bool ecc_enabled(struct amd64_pvt *pvt)
> {
> u16 nid = pvt->mc_node_id;
> @@ -3246,11 +3234,10 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
> amd64_info("Node %d: DRAM ECC %s.\n",
> nid, (ecc_en ? "enabled" : "disabled"));
>
> - if (!ecc_en || !nb_mce_en) {
> - amd64_info("%s", ecc_msg);
> + if (!ecc_en || !nb_mce_en)
> return false;
> - }
> - return true;
> + else
> + return true;
> }

Just a nit, but this else seems unnecessary right?

Thanks,
Yazen