Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

From: Jann Horn
Date: Mon Apr 29 2019 - 15:30:34 EST


On Sat, Apr 20, 2019 at 3:14 AM Kevin Easton <kevin@xxxxxxxxxxx> wrote:
> On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <cyphar@xxxxxxxxxx> wrote:
> > >
> > > On 2019-04-15, Enrico Weigelt, metux IT consult <lkml@xxxxxxxxx> wrote:
> > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > clone() system call as previously discussed.
> > > >
> > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > consider when introducing new CLONE_* flags.
> > > >
> > > > The reason I'm asking is:
> > > >
> > > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > > processes can change their own namespace at will. For that, certain
> > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > about making this its own feature. Doing that switch on clone() seems
> > > > a nice place for that, IMHO.
> > >
> > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > Not granting privileges such as setuid during execve(2) is the main
> > > point of that flag.
> > >
> >
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes. And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit. I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable. I suppose it would be nifty if there were a way for a
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess. Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec(). It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs". This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
> >
> > As basic examples, the improved process creation API should take a
> > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > from, fds to close (or, maybe even better, a list of fds to *not*
> > close), a list of rlimit changes to make, a list of signal changes to
> > make, the ability to set sid, pgrp, uid, gid (as in
> > setresuid/setresgid), the ability to do capset() operations, etc. The
> > posix_spawn() API, for all that it's rather complicated, covers a
> > bunch of the basics pretty well.
>
> The idea of a system call that takes an infinitely-extendable laundry
> list of operations to perform in kernel space seems quite inelegant, if
> only for the error-reporting reason.
>
> Instead, I suggest that what you'd want is a way to create a new
> embryonic process that has no address space and isn't yet schedulable.
> You then just need other-process-directed variants of all the normal
> setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> etc.
>
> Then when it's all set up you pr_execve() to kick it off.

Is this really necessary? I agree that fork()+exec() is suboptimal,
but if you just want to avoid the cost of duplicating the address
space, you can AFAICS already do that in userspace with
clone(CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD). Then
the parent can block on a futex until the child leaves the mm_struct
through execve() (or by exiting, in the case of an error), and the
child can temporarily have its stack at the bottom of the caller's
stack. You could build an API like this around it in userspace:

int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
<clone flags and arguments, maybe in a struct>)

and then you'd use it like this to fork off a child process:

int spawn_shell_subprocess_(void *arg) {
char *cmdline = arg;
execl("/bin/sh", "sh", "-c", cmdline);
return -1;
}
pid_t spawn_shell_subprocess(char *cmdline) {
pid_t child_pid;
int res = clone_temporary(spawn_shell_subprocess_, cmdline,
&child_pid, [...]);
if (res == 0) return child_pid;
return res;
}

clone_temporary() could be implemented roughly as follows by the libc
(or other userspace code):

sigset_t sigset, sigset_old;
sigfillset(&sigset);
sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
int child_pid;
int result = 0;
/* starting here, use inline assembly to ensure that no stack
allocations occur */
long child = syscall(__NR_clone,
CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
if (child == -1) { result = -1; goto reset_sigmask; }
if (child == 0) {
result = fn(arg);
syscall(__NR_exit, 0);
}
futex(&child_pid, FUTEX_WAIT, child, NULL);
/* end of no-stack-allocations zone */
reset_sigmask:
sigprocmask(SIG_SETMASK, &sigset_old, NULL);
return result;