Re: [PATCH v2 8/8] x86/MCE/AMD Support new memory interleaving modes during address translation

From: Yazen Ghannam
Date: Fri Sep 25 2020 - 16:15:05 EST


On Fri, Sep 25, 2020 at 09:22:31AM +0200, Borislav Petkov wrote:
> On Wed, Sep 23, 2020 at 11:25:10AM -0500, Yazen Ghannam wrote:
> > I don't remember the original reason, and I was recently asked about
> > this code living in a module. I did some looking after this ask, and I
> > found that we should be using this translation to get a proper value for
> > the memory error notifiers to use. So I think we still need to use this
> > function some way with the core code even if the EDAC interface isn't
> > used.
>
> You'd need to be more specific here, you want to bypass amd64_edac to
> decode errors? Judging by the current RAS activity coming from you guys,
> I'm thinking firmware. But then wouldn't the firmware do the decoding
> for us and then this function is not even needed?
>

The UC, NFIT, and CEC notifiers all operate on system physical
addresses. The address in the MCE record is checked by
mce_usable_address() to see if it can be used by the kernel, i.e. the
address is a system physical address. Right now, this check passes on
AMD systems if MCA_STATUS[AddrV] is set. This works for memory errors on
legacy AMD systems, since the NB MCA bank logs a physical address for
DRAM ECC errors. But this won't work on newer systems, because the UMC
MCA bank does not log a system physical address for DRAM ECC errors. So
the address provided by the hardware will need to be translated to a
physical address before the notifiers in the MCE chain can use it.

We can add support to get the physical address from firmware in some
cases. But it looks to me that we'll still need to keep updating the
translation code in the kernel to cover some platform/user
configurations. So it makes sense to me to move the functionality into a
module to make it easier to update.

The address translation needs to be done before the notfiers that need
it, and EDAC comes after all of them. There's also the case where the
EDAC interface isn't wanted, so amd64_edac will be unloaded. But the
functionality in the other notifiers are still expected to be available.
So it's more than just decoding the error like we do now with amd64_edac.
That's why I think the translation code can be in a separate module with
a notfier that runs before the others. This can do the translation once
then pass the result down to the CEC, UC, NFIT, and EDAC notifiers to
use as needed.

Thanks,
Yazen