Re: pidfd design

From: Daniel Colascione
Date: Wed Mar 20 2019 - 14:38:51 EST


On Wed, Mar 20, 2019 at 11:26 AM Christian Brauner <christian@xxxxxxxxxx> wrote:
> On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote:
> >
> >
> > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione <dancol@xxxxxxxxxx> wrote:
> > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> > ><christian@xxxxxxxxxx> wrote:
> > >>
> > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> > ><joel@xxxxxxxxxxxxxxxxx> wrote:
> > >> > >
> > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
> > >wrote:
> > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
> > >wrote:
> > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> > ><christian@xxxxxxxxxx> wrote:
> > >> > > > > > So I dislike the idea of allocating new inodes from the
> > >procfs super
> > >> > > > > > block. I would like to avoid pinning the whole pidfd
> > >concept exclusively
> > >> > > > > > to proc. The idea is that the pidfd API will be useable
> > >through procfs
> > >> > > > > > via open("/proc/<pid>") because that is what users expect
> > >and really
> > >> > > > > > wanted to have for a long time. So it makes sense to have
> > >this working.
> > >> > > > > > But it should really be useable without it. That's why
> > >translate_pid()
> > >> > > > > > and pidfd_clone() are on the table. What I'm saying is,
> > >once the pidfd
> > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
> > >- even
> > >> > > > > > though that's crazy - and still be able to use pidfds. This
> > >is also a
> > >> > > > > > point akpm asked about when I did the pidfd_send_signal
> > >work.
> > >> > > > >
> > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
> > >pidfds. One
> > >> > > > > crazy idea that I was discussing with Joel the other day is
> > >to just
> > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
> > >get_procfs_root()
> > >> > > > > system call that returned, out of thin air and independent of
> > >the
> > >> > > > > mount table, a procfs root directory file descriptor for the
> > >caller's
> > >> > > > > PID namspace and suitable for use with openat(2).
> > >> > > >
> > >> > > > Even if this works I'm pretty sure that Al and a lot of others
> > >will not
> > >> > > > be happy about this. A syscall to get an fd to /proc?
> > >> >
> > >> > Why not? procfs provides access to a lot of core kernel
> > >functionality.
> > >> > Why should you need a mountpoint to get to it?
> > >> >
> > >> > > That's not going
> > >> > > > to happen and I don't see the need for a separate syscall just
> > >for that.
> > >> >
> > >> > We need a system call for the same reason we need a getrandom(2):
> > >you
> > >> > have to bootstrap somehow when you're in a minimal environment.
> > >> >
> > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> > >> >
> > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> > >> > proposing that we *hardwire* it as the default and just declare
> > >that
> > >> > it's not possible to build a Linux kernel that doesn't include
> > >procfs.
> > >> > Why do we even have that button?
> > >> >
> > >> > > I think his point here was that he wanted a handle to procfs no
> > >matter where
> > >> > > it was mounted and then can later use openat on that. Agreed that
> > >it may be
> > >> > > unnecessary unless there is a usecase for it, and especially if
> > >the /proc
> > >> > > directory being the defacto mountpoint for procfs is a universal
> > >convention.
> > >> >
> > >> > If it's a universal convention and, in practice, everyone needs
> > >proc
> > >> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
> > >If
> > >> > we advertise /proc as not merely some kind of optional debug
> > >interface
> > >> > but *the* way certain kernel features are exposed --- and there's
> > >> > nothing wrong with that --- then we should give programs access to
> > >> > these core kernel features in a way that doesn't depend on
> > >userspace
> > >> > kernel configuration, and you do that by either providing a
> > >> > procfs-root-getting system call or just hardwiring the "/proc/"
> > >prefix
> > >> > into VFS.
> > >> >
> > >> > > > Inode allocation from the procfs mount for the file descriptors
> > >Joel
> > >> > > > wants is not correct. Their not really procfs file descriptors
> > >so this
> > >> > > > is a nack. We can't just hook into proc that way.
> > >> > >
> > >> > > I was not particular about using procfs mount for the FDs but
> > >that's the only
> > >> > > way I knew how to do it until you pointed out anon_inode (my grep
> > >skills
> > >> > > missed that), so thank you!
> > >> > >
> > >> > > > > C'mon: /proc is used by everyone today and almost every
> > >program breaks
> > >> > > > > if it's not around. The string "/proc" is already de facto
> > >kernel ABI.
> > >> > > > > Let's just drop the pretense of /proc being optional and bake
> > >it into
> > >> > > > > the kernel proper, then give programs a way to get to /proc
> > >that isn't
> > >> > > > > tied to any particular mount configuration. This way, we
> > >don't need a
> > >> > > > > translate_pid(), since callers can just use procfs to do the
> > >same
> > >> > > > > thing. (That is, if I understand correctly what translate_pid
> > >does.)
> > >> > > >
> > >> > > > I'm not sure what you think translate_pid() is doing since
> > >you're not
> > >> > > > saying what you think it does.
> > >> > > > Examples from the old patchset:
> > >> > > > translate_pid(pid, ns, -1) - get pid in our pid namespace
> > >> >
> > >> > Ah, it's a bit different from what I had in mind. It's fair to want
> > >to
> > >> > translate PIDs between namespaces, but the only way to make the
> > >> > translate_pid under discussion robust is to have it accept and
> > >produce
> > >> > pidfds. (At that point, you might as well call it translate_pidfd.)
> > >We
> > >> > should not be adding new APIs to the kernel that accept numeric
> > >PIDs:
> > >>
> > >> The traditional pid-based api is not going away. There are users that
> > >> have the requirement to translate pids between namespaces and also
> > >doing
> > >> introspection on these namespaces independent of pidfds. We will not
> > >> restrict the usefulness of this syscall by making it only work with
> > >> pidfds.
> > >>
> > >> > it's not possible to use these APIs correctly except under very
> > >> > limited circumstances --- mostly, talking about init or a parent
> > >>
> > >> The pid-based api is one of the most widely used apis of the kernel
> > >and
> > >> people have been using it quite successfully for a long time. Yes,
> > >it's
> > >> rac, but it's here to stay.
> > >>
> > >> > talking about its child.
> > >> >
> > >> > Really, we need a few related operations, and we shouldn't
> > >necessarily
> > >> > mingle them.
> > >>
> > >> Yes, we've established that previously.
> > >>
> > >> >
> > >> > 1) Given a numeric PID, give me a pidfd: that works today: you just
> > >> > open /proc/<pid>
> > >>
> > >> Agreed.
> > >>
> > >> >
> > >> > 2) Given a pidfd, give me a numeric PID: that works today: you just
> > >> > openat(pidfd, "stat", O_RDONLY) and read the first token (which is
> > >> > always the numeric PID).
> > >>
> > >> Agreed.
> > >>
> > >> >
> > >> > 3) Given a pidfd, send a signal: that's what pidfd_send_signal
> > >does,
> > >> > and it's a good start on the rest of these operations.
> > >>
> > >> Agreed.
> > >>
> > >> > 5) Given a pidfd in NS1, get a pidfd in NS2. That's what
> > >translate_pid
> > >> > is for. My preferred signature for this routine is
> > >translate_pid(int
> > >> > pidfd, int nsfd) -> pidfd. We don't need two namespace arguments.
> > >Why
> > >> > not? Because the pidfd *already* names a single process, uniquely!
> > >>
> > >> Given that people are interested in pids we can't just always return
> > >a
> > >> pidfd. That would mean a user would need to do get the pidfd read
> > >from
> > >> <pidfd>/stat and then close the pidfd. If you do that for a 100 pids
> > >or
> > >> more you end up allocating and closing file descriptors constantly
> > >for
> > >> no reason. We can't just debate pids away. So it will also need to be
> > >> able to yield pids e.g. through a flag argument.
> > >
> > >Sure, but that's still not a reason that we should care about pidfds
> > >working separately from procfs..
>
> That's unrelated to the point made in the above paragraph.
> Please note, I said that the pidfd api should work when proc is not
> available not that they can't be dirfds.

What do you mean by "not available"? CONFIG_PROCFS=n? If pidfds
supposed to work when proc is unavailable yet also be directory FDs,
to what directory should the FD refer? As I mentioned in my previous
message, trying to make pidfd work without CONFIG_PROCFS is a very bad
idea.

>
> >
> > Agreed. I can't imagine pidfd being anything but a proc pid directory handle. So I am confused what Christian meant. Pidfd *is* a procfs directory fid always. That's what I gathered from his pidfd_send_signal patch but let me know if I'm way off in the woods.
>
> (K9 Mail still hasn't learned to wrap lines at 80 it seems. :))
>
> Again, I never said that pidfds should be a directory handle.
> (Though I would like to point out that one of the original ideas I
> discussed at LPC was to have something like this to get regular file
> descriptors instead of dirfds:
> https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df)

As I mentioned in my original email on this thread, if you have
regular file descriptors instead of directory FDs, you have to use
some special new API instead of openat to get metadata about a
process. That's pointless duplication of functionality considering
that a directory FD gives you that information automatically.

> > For my next revision, I am thinking of adding the flag argument Christian mentioned to make translate_pid return an anon_inode FD which can be used for death status, given a <pid>. Since it is thought that translate_pid can be made to return a pid FD, I think it is ok to have it return a pid status FD for the purposes of the death status as well.

> translate_pid() should just return you a pidfd. Having it return a pidfd
> and a status fd feels like stuffing too much functionality in there. If
> you're fine with it I'll finish prototyping what I had in mind. As I
> said in previous mails I'm already working on this.

translate_pid also needs to *accept* pidfds, at least optionally.
Unless you have a function from pidfd to pidfd, you race.