Re: [RFC PATCH for 4.18 1/2] rseq: validate rseq_cs fields are < TASK_SIZE
From: Andy Lutomirski
Date: Fri Jun 29 2018 - 12:08:02 EST
On Fri, Jun 29, 2018 at 8:27 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
>
> On Fri, Jun 29, 2018, 08:03 Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>
>>
>> Considering those inconsistencies between architectures (either
>> the task gets killed, or the top bits are silently cleared), I'm
>> very much tempted to be restrictive in the inputs accepted by
>> rseq, and not rely on architectures as providing consistent
>> validation of the return IP.
>>
>> Thoughts ?
>
>
> Then you need to make it a compat system call, since clearly you and Andy
> want the 32-bit case to do something different from the 64-bit case.
I personally would like the compat and non-compat cases to do exactly
the same thing. If abort_ip is the address (as a u64) of a valid
executable instruction and an abort happens, then that instruction
should get executed. If abort_ip does not point to user-executable
memory, then the process should get a signal.
The problem isn't with rseq per se -- it's with the daft way that the
x86 return-to-userspace instructions work. If I apply this patch:
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..26e4ba44e87b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -346,6 +346,8 @@ __visible void do_int80_syscall_32(struct pt_regs *regs)
{
enter_from_user_mode();
local_irq_enable();
+ if (!user_64bit_mode(regs))
+ regs->ip |= (1UL << 32);
do_syscall_32_irqs_on(regs);
}
the kernel *still works*. But this unconditionally uses the IRET
path, and I don't even want to speculate as to what the hell happens
if we exit using SYSRETL, or LRET, or SYSEXITL, or if Intel or AMD
ever gives us a new mode that gets rid of the espfix shite. IOW, I
don't think that the x86 entry code should make a promise that it will
continue ignoring the high bits of regs->ip when
!user_64bit_mode(regs) on a 64-bit kernel.
The problem with rseq as it stands is that this oddity gets
accidentally exposed all the way to userspace. If we're not careful,
it'll be possible for a slightly buggy user program to goof up the
code that generates the data structure that supplies abort_ip such
that the high bits are garbage (0xcccccccc due to padding, for
example), and the kernel will do exactly what the user code requested,
and we'll get regs->ip = 0xcccccccc00000000 | (the actual intended
abort_ip), and we'll pass that crap value all the way to IRET, and
IRET will truncate it back down to the correct value. And then some
CPU will add new behavior or we'll invoke SYSRETL or whatever on some
weird CPU, and the program will crash. And we'll be sad.
I suppose we could handle this in the entry code by coming up with a
way to reject out-of-bounds regs->ip for 32-bit tasks, but that's
going to be a bit messy and will slow down normal code that doesn't
use rseq.
Other than rseq, I don't think that there's any real issue. The only
ways to get regs->ip >= 2^32 in a 32-bit task involve ptrace or manual
fiddling with signal contexts, and I don't expect to ever have any
real software depend on precisely what happens.