Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier

From: Sargun Dhillon
Date: Fri Apr 29 2022 - 13:14:45 EST


On Fri, Apr 29, 2022 at 11:42:15AM +0200, Rodrigo Campos wrote:
> On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> > the concept is searchable. If the notifying process is signaled prior
> > to the notification being received by the userspace agent, it will
> > be handled as normal.
>
> Why is that? Why not always handle in the same way (if wait killable
> is set, wait like that)
>

The goal is to avoid two things:
1. Unncessary work - Often times, we see workloads that implement techniques
like hedging (Also known as request racing[1]). In fact, RFC3484
(destination address selection) gets implemented where the DNS library
will connect to many backend addresses and whichever one comes back first
"wins".
2. Side effects - We don't want a situation where a syscall is in progress
that is non-trivial to rollback (mount), and from user space's perspective
this syscall never completed.

Blocking before the syscall even starts is excessive. When we looked at this
we found that with runtimes like Golang, they can get into a bad situation
if they have many (1000s) of threads that are in the middle of a syscall
because all of them need to elide prior to GC. In this case the runtime
prioritizes the liveness of GC vs. the syscalls.

That being said, there may be some syscalls in a filter that need the suggested
behaviour. I can imagine introducing a new flag
(say SECCOMP_FILTER_FLAG_WAIT_KILLABLE) that applies to all states.
Alternatively, in one implementation, I put the behaviour in the data
field of the return from the BPF filter.


> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index db10e73d06e0..9291b0843cb2 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1081,6 +1088,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn
> > complete(&addfd->completion);
> > }
> >
> > +static bool should_sleep_killable(struct seccomp_filter *match,
> > + struct seccomp_knotif *n)
> > +{
> > + return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT;
>
> Here for some reason we check the notification state to be SENT.
>
Because we don't want to block unless the notification has been received
by userspace.

> > +}
> > +
> > static int seccomp_do_user_notification(int this_syscall,
> > struct seccomp_filter *match,
> > const struct seccomp_data *sd)
> > @@ -1111,11 +1124,25 @@ static int seccomp_do_user_notification(int this_syscall,
> > * This is where we wait for a reply from userspace.
> > */
> > do {
> > + bool wait_killable = should_sleep_killable(match, &n);
> > +
>
> So here, the first time this runs this will be false even if the
> wait_killable flag was used in the filter (because that function
> checks the notification state to be sent, that is not true the first
> time)
>
> Why not just do wait_for_completion_killable if match->wait_killable
> and wait_for_completion_interruptible otherwise? Am I missing
> something?
Again, this is to allow for the notification to be able to be preempted
prior to being received by the supervisor.

>
>
>
> Best,
> Rodrigo
[1]: https://research.google/pubs/pub40801/