Re: [PATCH v13] x86, mce: Add memcpy_trap()

From: Linus Torvalds
Date: Thu Feb 25 2016 - 21:38:46 EST


On Thu, Feb 25, 2016 at 5:19 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> Then let's answer the API question instead of the implementation question.
>
> If a user program accesses a bad virtual address directly, it gets
> SIGSEGV or SIGBUS depending on the nature of the error. This is
> long-established practice. The SIGSEGV case is programmer error and
> the SIGBUS case might be an IO error.

Yes. I don't think there's much question there.

It would be a SIGBUS, and then we should fill in something sane in the
siginfo. So we'd have a si_code that makes sense.

We already have BUS_MCEERR_AR and BUS_MCEERR_AO ("action required" vs
"action optional"), and that together with si_addr is presumably
sufficient.

So I don't think the signal delivery has any questionable issues, it's
fairly obvious what the ABI already is.

> If a user program accesses a bad virtual address by passing the
> address to a syscall, it gets EFAULT. This may be programmer error or
> and underlying IO error, and the program can't tell.
>
> If a user program accesses a bad address on an NVDIMM via mmap, what
> should happen? If mmaped NVDIMM (or other DAX space) is the same as
> existing poisoned memory, the program gets SIGBUS. This still makes
> sense.

Agreed.

> The question here: what happens if a program accesses a bad NVDIMM
> address by passing a pointer to a syscall? With Tony's patches as
> written, I think the program gets SIGBUS via memory_failure. Do we
> want that behavior?

I do think we want that behavior.

The traditional UNIX EFAULT behavior is an odd case, and the fact that
we do *not* send a SIGSEGV for faults that the kernel notices is
strange.

In fact, if you read POSIX, you'll notice that the standard often says
that it's implementation-defined, and the reason is that a lot of
system calls aren't "native" - they are wrappers in various libraries,
and depending on just how things work, the actual access to a bad
address may be done by the kernel (-EFAULT) or by the library wrapper
(SIGSEGV).

So the EFAULT behavior is actually pretty nasty. I don't think we
should necessarily use that as "this is how things should be done".

So I'd much rather say that "if you get a machine check on a user
address, we'll always queue a SIGBUS". Obviously, if it happens in a
system call, the system call itself will also return an error. And
also obviously, there's a limit to the queueing, so if you take
*multiple* machine checks in one system call, maybe you'd get just one
signal.

> If we take your suggestion and change only the
> error code, then the program will *not* get SIGBUS. Instead it will
> get -EFAULT or -ESOMETHINGELSE. Is that okay? If it is, then
> everything is straightforward and nothing in my previous email is
> relevant.

So I'd suggest that the *signal* be sent by the machine check handler,
and that it be done independently of the system call error we choose.

And in fact, if user mode is happy with that, it would certainly be
easier for us to just always return -EFAULT for the error, and a user
app that cares about machine checks will do all the error handling in
their signal handler.

There would be advantages to that not just for kernel simplicity: if
we start using a *new* error code, existing programs might be
confused. But that's a pretty small advantage.

And there are certainly also advantages to coming up with a new system
call error return value.

My gut feel is that people would probably prefer something other than
EFAULT. Especially if we already have an error code that libraries
already have a reasonable error string for. The extra complexity in
the kernel to add another error case isn't _that_ big.

Of course, if users actually come and say "we don't care at all about
the error code if we always get a signal, and we want the signal
anyway in order to get the address that is bad", then we just
shouldn't bother.

So I don't really have any very strong opinions there.

My strong opinions here really seem to be limited to "I don't think it
makes sense to have a special magical copy_to/from_user()
implementation".

[ Side note: *within* the confines of purely the kernel, and
libnvdimm, I do think that something like "memcpy_fault()" absolutely
makes sense.

So to me, "memcpy_fault()" is not a bad idea: that's a "this is
explicitly a recoverable memory copy operation within the kernel", and
that's a completely separate thing from "copy_to/from_user()" that is
also recoverable.

What a "memcpy_fault()" (or whatever it would be called) means is
that the kernel is doing its own copies, but knows that there is some
fragility involved, and wants to have a recovery mechanism that isn't
"oops, we got a machine check in the kernel, now we need to kill the
machine".

That "memcpy_fault()" would just use the same exception handling
mechanism to not make it a fatal error. We already do that for
possible speculative page-crossers in kernel memory for
"get_unaligned_zeropad()", for example, which is a pure in-kernel
fetch that we just know might fault.

So it's literally just "copy_to/from_user()" that I don't think
should have special cases outside of perhaps error returns ]

Linus