Re: For review: seccomp_user_notif(2) manual page

From: Tycho Andersen
Date: Thu Oct 01 2020 - 14:59:58 EST


On Thu, Oct 01, 2020 at 08:18:49PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen <tycho@tycho.pizza> wrote:
> > On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > > <christian.brauner@xxxxxxxxxxxxx> wrote:
> > > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > > <mtk.manpages@xxxxxxxxx> wrote:
> > > > > > NOTES
> > > > > > The file descriptor returned when seccomp(2) is employed with the
> > > > > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > > > > > poll(2), epoll(7), and select(2). When a notification is pend‐
> > > > > > ing, these interfaces indicate that the file descriptor is read‐
> > > > > > able.
> > > > >
> > > > > We should probably also point out somewhere that, as
> > > > > include/uapi/linux/seccomp.h says:
> > > > >
> > > > > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > > > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > > > * same syscall, the most recently added filter takes precedence. This means
> > > > > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > > > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > > > * such filtered syscalls to be executed by sending the response
> > > > > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > > > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > > >
> > > > > In other words, from a security perspective, you must assume that the
> > > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > > calling seccomp(). This should also be noted over in the main
> > > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > > >
> > > > So I was actually wondering about this when I skimmed this and a while
> > > > ago but forgot about this again... Afaict, you can only ever load a
> > > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > > one?
> > > >
> > > > static struct file *init_listener(struct seccomp_filter *filter)
> > > > {
> > > > struct file *ret = ERR_PTR(-EBUSY);
> > > > struct seccomp_filter *cur;
> > > >
> > > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > > if (cur->notif)
> > > > goto out;
> > > > }
> > > >
> > > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > > override each other for the same task simply because there can only ever
> > > > be a single one?
> > >
> > > Good point. Exceeeept that that check seems ineffective because this
> > > happens before we take the locks that guard against TSYNC, and also
> > > before we decide to which existing filter we want to chain the new
> > > filter. So if two threads race with TSYNC, I think they'll be able to
> > > chain two filters with listeners together.
> >
> > Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> > be totally effective,
> >
> > > I don't know whether we want to eternalize this "only one listener
> > > across all the filters" restriction in the manpage though, or whether
> > > the man page should just say that the kernel currently doesn't support
> > > it but that security-wise you should assume that it might at some
> > > point.
> >
> > This requirement originally came from Andy, arguing that the semantics
> > of this were/are confusing, which still makes sense to me. Perhaps we
> > should do something like the below?
> [...]
> > +static bool has_listener_parent(struct seccomp_filter *child)
> > +{
> > + struct seccomp_filter *cur;
> > +
> > + for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > + if (cur->notif)
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> [...]
> > @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
> [...]
> > + /* don't allow TSYNC to install multiple listeners */
> > + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> > + !has_listener_parent(thread->seccomp.filter))
> > + continue;
> [...]
> > @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
> > static struct file *init_listener(struct seccomp_filter *filter)
> [...]
> > - for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > - if (cur->notif)
> > - goto out;
> > - }
> > + if (has_listener_parent(current->seccomp.filter))
> > + goto out;
>
> I dislike this because it combines a non-locked check and a locked
> check. And I don't think this will work in the case where TSYNC and
> non-TSYNC race - if the non-TSYNC call nests around the TSYNC filter
> installation, the thread that called seccomp in non-TSYNC mode will
> still end up with two notifying filters. How about the following?

Sure, you can add,

Reviewed-by: Tycho Andersen <tycho@tycho.pizza>

when you send it.

Tycho