Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

From: josh
Date: Fri Mar 13 2015 - 17:45:04 EST


On Fri, Mar 13, 2015 at 02:16:07PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > > Hi Josh,
> > >
> > > From the overall description (i.e. I haven't looked at the code yet)
> > > this looks very interesting. However, it seems to cover a lot of the
> > > same ground as the process descriptor feature that was added to FreeBSD
> > >
> > > in 9.x/10.x:
> > > https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> >
> > Interesting.
>
> I wasn't aware of the FreeBSD implementation of pdfork(). It is actually
> exactly what I need in userspace.

Right; libqt should be able to use pdfork on FreeBSD and CLONE_FD on
Linux.

> The only difference between pdfork() and and
> my proposed forkfd() is where the PID and where the file descriptor are
> returned (meaning, which is optional and which isn't).
>
> Josh and I opted to return the file descriptor in the regular return value in
> forkfd and in clone4 because getting the file descriptor the whole objective of
> using the forkfd or clone4-with-CLONE_FD in the first place: the file descriptor
> is not optional, but the PID is.

And as long as you can get the fd, where it's returned really doesn't
matter.

> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those. I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> >
> > For example, pdfork can just call clone4 with CLONE_FD and return the
> > resulting file descriptor.
>
> Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().
>
> I'm not too attached to the forkfd() interface, but I find it slightly superior
> for the reasons above.

Agreed.

> If we want the PD_DAEMON flag, it will have to translate to a clone flag, like
> CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.

I think the inverted version makes more sense, so that the default
behavior just changes exit notification without adding the kill-on-close
behavior. And that kill-on-close behavior can come in a later patch. :)

> > In the future, I plan to add an fd-based equivalent of
> > rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> > whether to kill a process or thread) which is a superset of pdkill.
> > pdkill could then call that and just not pass the extra info.
> >
> > A fair bit of pdwait4 could be implemented on top of read(), other than
> > the full rusage information (see below), and the ability to wait for
> > STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> > but it'd have to be set via a flag at clone time).
> >
> > I think it's a feature to use read() rather than an additional magic
> > system call.
>
> Indeed, even if the libc provides a wrapper for you, like glibc does for
> eventfd (eventfd_read, eventfd_write).
>
> Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in
> the initial patch set, but it was part of the plans.
>
> > > > clone4() will never return a file descriptor in the range
> > > > 0-2 to
> > > > the caller, to avoid ambiguity with the return of 0 in the
> > > > child
> > > > process. Only the calling process will have the new
> > > > file
> > > > descriptor open; the child process will not.
> > >
> > > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > > return the file descriptor separately, which avoids the need for special
> > > case processing for low FD values (and means that POSIX's "lowest file
> > > descriptor not currently open" behaviour can be preserved if desired).
> >
> > That'd be easy to implement if desired, by adding an outbound pointer to
> > clone4_args.
> >
> > The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> > syscalls that use the fd as an identifier, PIDs can hopefully become
> > mostly unnecessary. However, I'm not that attached to changing the
> > return value; it'd be trivial to switch to an outbound parameter
> > instead, and then drop the "not 0-2".
>
> See above for more motivation on making the PID optional.
>
> As for the file descriptor range, if we need to be able to return 0, we can
> implement a magic constant to mean the child process, like the userspace
> forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on
> Linux, since that is neither a valid file descriptor nor a valid errno value.

I don't think that logic is worth implementing, though, since it would
require changing all the architecture-specific copy_thread
implementations. If we really want to go this path, we should just
return the fd via an out parameter in the clone4_args structure.

> > > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > > process descriptor, including rusage retrieval. However, I don't think
> > >
> > > they actually implemented it:
> > > http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
> >
> > That's a pretty good argument that we don't need to either, at least not
> > yet.
>
> pdwait4() can be implemented on top of read(), with the WNOHANG flag being just
> toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We
> could implement it via more ioctls to be done prior to read() if we don't want
> to add a syscall...
>
> Another alternative is to add a P_PD flag that can be passed as the first
> argument to waitid(), making the second argument a file descriptor instead of a
> PID or pgrp.

Or a flag that can be added to the options argument of wait4 to indicate
that the first argument is really a file descriptor.

> > > FreeBSD also implements fstat(2) for its process descriptors, although
> > > only a few of the fields get filled in.
> >
> > I looked at what they provide, and that seems like more of a novelty
> > than something particularly useful (since most of the stat fields aren't
> > meaningful), but if that's useful for compatibility then adding it seems
> > fine.
>
> I don't think we need to do anything: anon_inode will do it for us.
>
> If I stat an eventfd:
>
> stat("/proc/107751/fd/4", {st_dev=makedev(0, 9), st_ino=3943,
> st_mode=0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0,
> st_size=0, st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
>
> And just out of curiosity, in the following order: epoll, signalfd, timerfd
> and inotify:
>
> stat("/proc/1462/fd/4", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/5", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/7", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/8", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
>
> (that process is systemd --user)

Interesting. What does stat on a CLONE_FD file descriptor return?

> > > > poll(2), select(2), epoll(7) (and similar)
> > > >
> > > > The file descriptor is readable (the select(2)
> > > > readfds
> > > > argument; the poll(2) POLLIN flag) if the new
> > > > process has
> > > > exited.
> > >
> > > FreeBSD uses POLLHUP here.
> >
> > That makes sense given that they provide the information via a separate
> > call rather than read. Since the CLONE_FD file descriptor uses read, it
> > needs to provide POLLIN, but I have no objection to using *both* POLLIN
> > and POLLHUP if that'd be at all useful.
>
> I think we should provide both, since we're notifying that there are things to
> be read and that the file descriptor has closed.

"closed" in the "other end of the not-quite-a-pipe" sense, sure. I'll
add that in a v2.

> > > FreeBSD has two different behaviours for close(2), depending on a flag
> > > value (PD_DAEMON). With the flag set it's roughly like this, but
> > > without PD_DAEMON a close(2) operation on the (last open) file
> > > descriptor terminates the child process.
> > >
> > > This can be quite useful, particularly for the use case where some
> > > userspace library has an FD-controlled subprocess -- if the application
> > > using the library terminates, the process descriptor is closed and so
> > > the subprocess is automatically terminated.
> >
> > That's an interesting idea. I don't think it makes sense for that to be
> > the default behavior, but if someone wanted to add an additional flag
> > to implement that behavior, that seems fine. A FreeBSD-compatible
> > pdfork could then use that flag when not passed PD_DAEMON and not use it
> > when passed PD_DAEMON.
> >
> > How does it kill the process when the last open descriptor closes?
> > SIGKILL? SIGTERM? The former seems unfriendly (preventing graceful
> > termination), and the latter blockable. There's a reason init systems
> > send TERM, then wait, then KILL.
>
> I was wondering if it shouldn't be a SIGHUP, since we're talking about a file
> descriptor closing. We could make it configurable too, but I'd rather not use
> the current CSIGNAL field -- better move to the arguments structure, just in
> case someone is passing SIGCHLD there, they should get EINVAL instead of
> silently sending SIGCHLD to the child process to ask it to terminate.

That sounds like several good reasons right there to defer "kill on
close" to a future patch, the author of which should research how
FreeBSD implements this.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/