Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

From: Florian Weimer
Date: Tue Apr 30 2019 - 13:07:51 EST


* Linus Torvalds:

> On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> Of course, if you *don't* need the exact vfork() semantics, clone
>> itself actually very much supports a callback model with s separate
>> stack. You can basically do this:
>>
>> - allocate new stack for the child
>> - in trivial asm wrapper, do:
>> - push the callback address on the child stack
>> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
>> - "ret"
>> - free new stack
>>
>> where the "ret" in the child will just go to the callback, while the
>> parent (eventually) just returns from the trivial wrapper and frees
>> the new stack (which by definition is no longer used, since the child
>> has exited or execve'd.
>
> In fact, Florian, maybe this is the solution to your "I want to use
> vfork for posix_spawn(), but I don't know if I can trust it" problem.
>
> Just use clone() directly. On WSL it will presumably just fail, and
> you can then fall back on doing the slow stupid
> fork+pipes-to-communicate.

We already use clone. I don't know why. We should add a comment that
provides the reason.

> On valgrind, I don't know what will happen. Maybe it will just do an
> unchecked posix_spawn() because valgrind doesn't catch it?

I think what happens with these emulators that they use fork (no shared
address space) but suspend the parent thread. clone with CLONE_VFORK
definitely does not fail. That mostly works, except that you do not get
back the error code from the execve. Instead, the process is considered
launched, and the caller collects the exit status from the _exit after
the failed execve.

> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
>
> - allocate new stack for the child
> - in trivial asm wrapper, do:
> - push the callback address on the child stack
> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
> - "ret"
> - free new stack
>
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.
>
> So you can most definitely create a "vfork_with_child_callback()" with
> clone, and it would arguably be a much superior interface to vfork()
> anyway (maybe you'd like to pass in some arguments to the callback too
> - add more stack setup for the child as needed), but it wouldn't be
> the right solution for programs that just want to use the standard BSD
> vfork() model.

As far as we understand the situation, we believe that we absolutely
must block all signals for both the parent thread and the new
subprocess. Signals can be unblocked in the subprocess, but only after
setting their handlers to SIG_DFL or SIG_IGN. (Parent signal handlers
cannot run in the subprocess because application-supplied signal
handlers generally do not expect to run with a corrupt TCBâor a
different PID.)

At that point, I wonder if we can just skip the stack setup for the
CLONE_VFORK case and reuse the existing stack.

Thanks,
Florian