Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls

From: Peter Oskolkov
Date: Mon Nov 29 2021 - 18:39:52 EST


On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
[...]
> > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> > >
> > > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > > supposed to do.
[...]
>
> So then A does:
>
> A::next_tid = C.tid;
> sys_umcg_wait();
>
> Which will:
>
> pin(A);
> pin(S0);
>
> cmpxchg(A::state, RUNNING, RUNNABLE);

Hmm.... That's another difference between your patch and mine: my
approach was "the side that initiates the change updates the state".
So in my code the userspace changes the current task's state RUNNING
=> RUNNABLE and the next task's state, or the server's state, RUNNABLE
=> RUNNING before calling sys_umcg_wait(). The kernel changed worker
states to BLOCKED/RUNNABLE during block/wake detection, and marked
servers RUNNING when waking them during block/wake detection; but all
applicable state changes for sys_umcg_wait() happen in the userspace.

The reasoning behind this approach was:
- do in kernel only that which cannot be done in the userspace, to
make the kernel code smaller/simpler
- similar to how futexes work: futex_wait does not change the futex
value to the desired value, but just checks whether the futex value
matches the desired value
- similar to how futexes work, concurrent state changes can happen in
the userspace without calling into the kernel at all
for example:
- (a): worker A goes to sleep into sys_umcg_wait()
- (b): worker B wants to context switch into worker A "a moment" later
- due to preemption/interrupts/pagefaults/whatnot, (b) happens
in reality before (a)
in my patchset, the situation above happily resolves in the
userspace so that worker A keeps running without ever calling
sys_umcg_wait().

Again, I don't think this is deal breaking, and your approach will
work, just a bit less efficiently in some cases :)

I'm still not sure we can live without UMCG_TF_LOCKED. What if worker
A transfers its server to worker B that A intends to context switch
into, and then worker A pagefaults or gets interrupted before calling
sys_umcg_wait()? The server will be woken up and will see that it is
assigned to worker B; now what? If worker A is "locked" before the
whole thing starts, the pagefault/interrupt will not trigger
block/wake detection, worker A will keep RUNNING for all intended
purposes, and eventually will call sys_umcg_wait() as it had
intended...

[...]