Re: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

From: Luck, Tony
Date: Wed Nov 11 2015 - 16:48:17 EST


On Wed, Nov 11, 2015 at 09:41:58PM +0100, Borislav Petkov wrote:
> On Tue, Nov 10, 2015 at 01:55:46PM -0800, Luck, Tony wrote:
> > I need to add more to the motivation part of this. The people who want
> > this are playing with NVDIMMs as storage. So think of many GBytes of
> > non-volatile memory on the source end of the memcpy(). People are used
> > to disk errors just giving them a -EIO error. They'll be unhappy if an
> > NVDIMM error crashes the machine.
>
> Ah.
>
> Btw, there's no flag, by chance, somewhere in the MCA regs bunch at
> error time which says that the error is originating from NVDIMM? Because
> if there were, this patchset is moot. :)

No flag. We can search MCi_ADDR across the ranges to see whether this
was a normal RAM error on non-volatile. But that doesn't make this patch
moot. We still need to change the return address to go to the fixup code
instead of back to the place where we hit the error. The exception table
is a list of pairs of instruction pointers:

[Instruction-that-may-fault, Address-of-fixup-code]

In my RFC code I only have one function that can fault, and all the fixup
addresses point to the same place. But that doesn't scale to adding more
functions (like mcsafe_copy_from_user()).

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/