Re: [PATCH 1/2] seccomp: notify user trap about unused filter

From: Christian Brauner
Date: Wed May 27 2020 - 18:45:06 EST


On Wed, May 27, 2020 at 03:37:58PM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 12:05:32AM +0200, Christian Brauner wrote:
> > The main question also is, is there precedence where the kernel just
> > closes the file descriptor for userspace behind it's back? I'm not sure
> > I've heard of this before. That's not how that works afaict; it's also
> > not how we do pidfds. We don't just close the fd when the task
> > associated with it goes away, we notify and then userspace can close.
>
> But there's a mapping between pidfd and task struct that is separate
> from task struct itself, yes? I.e. keeping a pidfd open doesn't pin
> struct task in memory forever, right?

No, but that's an implementation detail and we discussed that. It pins
struct pid instead of task_struct. Once the process is fully gone you
just get ESRCH.
For example, fds to /proc/<pid>/<tid>/ fds aren't just closed once the
task has gone away, userspace will just get ESRCH when it tries to open
files under there but the fd remains valid until close() is called.

In addition, of all the anon inode fds, none of them have the "close the
file behind userspace back" behavior: io_uring, signalfd, timerfd, btf,
perf_event, bpf-prog, bpf-link, bpf-map, pidfd, userfaultfd, fanotify,
inotify, eventpoll, fscontext, eventfd. These are just core kernel ones.
I'm pretty sure that it'd be very odd behavior if we did that. I'd
rather just notify userspace and leave the close to them. But maybe I'm
missing something.

Christian