Re: [patch v4 1/8] add basic task isolation prctl interface
From: Marcelo Tosatti
Date: Thu Oct 14 2021 - 09:42:20 EST
<snip>
> What are the requirements of the signal exactly (and why it is popular) ?
> Because the interruption event can be due to:
>
> * An IPI.
> * A system call.
IRQs (easy to trace), exceptions.
> In the "full task isolation mode" patchset (the one from Alex), a system call
> will automatically generate a SIGKILL once a system call is performed
> (after the prctl to enable task isolated mode, but
> before the prctl to disable task isolated mode).
> This can be implemented, if desired, by SECCOMP syscall blocking
> (which already exists).
>
> For other interruptions, which happen through IPIs, one can print
> the stack trace of the program (or interrupt) that generated
> the IPI to find out the cause (which is what rt-trace-bpf.py is doing).
>
> An alternative would be to add tracepoints so that one can
> find out which function in the kernel caused the CPU and
> task to become "a target for interruptions".
For example, adding a tracepoint to mark_vmstat_dirty() function
(allowing to see how that function was invoked on a given CPU, and
by whom) appears to be sufficient information to debug problems.
(mark_vmstat_dirty() from
[patch v4 5/8] task isolation: sync vmstats conditional on changes)
Instead of a coredump image with a SIGKILL sent at that point.
Looking at
https://github.com/abelits/libtmc
One can see the notification via SIGUSR1 being used.
To support something similar to it, one would add a new bit to
flags field of:
+struct task_isol_activate_control {
+ __u64 flags;
+ __u64 quiesce_oneshot_mask;
+ __u64 pad[6];
+};
Remove
+ ret = -EINVAL;
+ if (act_ctrl.flags)
+ goto out;
>From the handler, shrink the padded space and use it.
>
> > > > Also, see:
> > > >
> > > > https://lkml.kernel.org/r/20210929152429.186930629@xxxxxxxxxxxxx
> > >
> > > As you can see from the below pseudocode, we were thinking of queueing
> > > the (invalidate icache or TLB flush) in case app is in userspace,
> > > to perform on return to kernel space, but the approach in your patch might be
> > > superior (will take sometime to parse that thread...).
> >
> > Let me assume you're talking about kernel TLB invalidates, otherwise it
> > would be terribly broken.
> >
> > > > Suppose:
> > > >
> > > > CPU0 CPU1
> > > >
> > > > sys_prctl()
> > > > <kernel entry>
> > > > // marks task 'important'
> > > > text_poke_sync()
> > > > // checks CPU0, not userspace, queues IPI
> > > > <kernel exit>
> > > >
> > > > $important userspace arch_send_call_function_ipi_mask()
> > > > <IPI>
> > > > // finds task is 'important' and
> > > > // can't take interrupts
> > > > sigkill()
> > > >
> > > > *Whoopsie*
> > > >
> > > >
> > > > Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> > > > therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> > > > to userspace, it'll get killed on interruption. Catch-22.
To reiterate on this point:
> > > > CPU0 CPU1
> > > >
> > > > sys_prctl()
> > > > <kernel entry>
> > > > // marks task 'important'
> > > > text_poke_sync()
> > > > // checks CPU0, not userspace, queues IPI
> > > > <kernel exit>
1) Such races can be fixed by proper uses of atomic variables.
2) If a signal to an application is desired, fail to see why this
interface (ignoring bugs related to the particular mechanism) does not
allow it.
So hopefully this addresses your comments.