RE: [PATCH RESEND 2/5] x86/MCE: Handle MCA controls in a per_cpu way

From: Ghannam, Yazen
Date: Wed Apr 10 2019 - 12:36:34 EST


> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Tuesday, April 9, 2019 3:34 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; tony.luck@xxxxxxxxx; x86@xxxxxxxxxx
> Subject: Re: [PATCH RESEND 2/5] x86/MCE: Handle MCA controls in a per_cpu way
>
> On Mon, Apr 08, 2019 at 06:55:59PM +0000, Ghannam, Yazen wrote:
> > We already have the case where some banks are not initialized either
> > due to quirks or because they are Read-as-Zero, but we don't try to
> > skip creating their files. With this full set (see patch 5), an unused
> > bank will return a control value of 0.
>
> So set_bank() is changed to do:
>
> @@ -2088,7 +2097,7 @@ static ssize_t set_bank(struct device *s, struct device_attribute *attr,
> if (kstrtou64(buf, 0, &new) < 0)
> return -EINVAL;
>
> - if (bank >= mca_cfg.banks)
> + if (bank >= per_cpu(num_banks, s->id))
> return -EINVAL;
>
>
> How would that work if the disabled/not-present bank is in the middle?
> The old example: bank3 on CPU5.
>
> > Would that be sufficient to indicate that a bank is not used?
>
> Well, it should not allow for any control bits to be set and it should
> have the proper bank number.
>

We have this case on AMD Family 17h with Bank 4. The hardware enforces this bank to be Read-as-Zero/Writes-Ignored.

This behavior is enforced whether the bank is in the middle or at the end.

> > But I do have a couple of thoughts:
>
> > 1) Will missing banks confuse users? As mentioned, we already have the
> > case of unused/uninitialized banks today, but we don't skip their file
> > creation. a) Will this affect any userspace tools?
>
> I guess it would be easier if we keep creating all files but denote properly
> which banks are disabled.
>

I'm thinking to redo the sysfs interface for banks in another patch set. I could include a new file to indicate enabled/disabled, or maybe just update the documentation to describe this case.

Thanks,
Yazen