Re: [RFC 0/3] seccomp trap to userspace

From: Christian Brauner
Date: Fri Mar 16 2018 - 12:41:01 EST


On Fri, Mar 16, 2018 at 09:01:47AM -0700, Andy Lutomirski wrote:
>
>
> > On Mar 16, 2018, at 7:47 AM, Christian Brauner <christian.brauner@xxxxxxxxxxx> wrote:
> >
> >> On Fri, Mar 16, 2018 at 12:46:55AM +0000, Andy Lutomirski wrote:
>
>
> I bet I confused everyone with a blatant typo:
>
> >>
> >> Hmm, I think we have to be very careful to avoid nasty races. I think
> >> the correct approach is to notice the signal and send a message to the
> >> listener that a signal is pending but to take no additional action.
> >> If the handler ends up completing the syscall with a successful
> >> return, we don't want to replace it with -EINTR. IOW the code looks
> >> kind of like:
> >>
> >> send_to_listener("hey I got a signal");
>
> That should be âhey I got a syscallâ. Dâoh!

Ha ok, that's what led me to believe that listener != handler and I was
trying to make sense of thise. :)

Thanks!
Christian

>
> >> wait_ret = wait_interruptible for the listener to reply;
> >> if (wait_ret == -EINTR) {
> >
> > Hm, so from the pseudo-code it looks like: The handler would inform the
> > listener that it received a signal (either from the syscall requester or
> > from somewhere else) and then wait for the listener to reply to that
> > message. This would allow the listener to decide what action it wants
> > the handler to take based on the signal, i.e. either cancel the request
> > or retry? The comment makes it sound like that the handler doesn't
> > really wait on the listener when it receives a signal it simply moves
> > on.
>
> It keeps waiting killably but not interruptibly.
>
> > So no "taking no additional action" here means not have the handler
> > decide to abort but the listener?
>
> If by âhandlerâ you mean kernel, then yes.
>
> Thereâs no userspace syscall handler involved. From the kernelâs perspective, a syscall is never still in progress when a signal handler is invoked â we only actually invoke syscall handlers in prepare_exit_to_usermode() or the non-x86 equivalent and the functions it calls. While a syscall is running, the kernel might notice that a signal is pending and do one of a few things:
>
> 1. Just keep going. Not all syscalls can be interrupted.
>
> 2. Try to finish early. If a send() call has already sent some but not all data, it can stop waiting and return the number of bytes sent.
>
> 3. Abort with -EINTR.
>
> 4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user registers in a somewhat unpleasant way to pretend that the syscall never actually happened. This works for syscalls that wait with an absolute timeout, for example.
>
> 5. Set up restart_syscall() magic, rewrite regs so it looks like the user was about to call restart_syscall() when the signal happened, and abort.
>
> In all cases, the signal is dealt with afterwards. This could result in changing regs to call the handler or in simply returning.
>
> 1-3 should work fully in seccomp. The only issue is that the kernel doesnât know *which* to do, nor can the kernel force the listener to abort cleanly, so I think we have no real choice but to let the listener decide.
>
> 4 could be supported just like 1-3. 5 is awful, and I donât think we should support it for user listeners.