SECCOMP_RET_USER_NOTIF: listener improvements

From: Christian Brauner
Date: Wed Apr 24 2019 - 11:04:32 EST


Hey everyone,

So I was working on making use of the seccomp listener stuff and I
stumbled upon a problem. Imagine a scenario where:

1. Task T1 installs Filter F1 and gets and listener fd for that filter FD1
2. T1 sends FD1 via SCM_RIGHTS to task T2
T2 now holds a reference to the same underlying struct file as FD1 via FD2
3. T2 registers FD2 in an event loop and starts listening for events
4. T1 exits and wipes FD1

Now, T2 still holds a reference to the filter via FD2 which references
the same underlying file as FD1 which has the seccomp filter stashed in
private_data.
So T2 will never get notified that the filter is essentially unused and
doesn't know when to exit, i.e. it has no way of telling when T1 and all
of its children using the same filter are gone.

I think we should have a way to do this *or* alternatively have a way to
attach a process to an existing filter.

The scenario described above arises pretty naturally on container
attach. The standard way of doing this is usually
fork() + attach_namespaces() + clone(CLONE_PARENT) where you don't
share the filter of container's init. So the seccomp context has to be
recreated. [1]

Opinions?

Christian

[1]: Note, that systemd-nspawn is creating the process by talking
to the container's systemd and requesting it runs the programs via
transient units but this only works if systemd is run inside the
container and if you trust the workload.

Attachment: signature.asc
Description: PGP signature