Re: [RFC PATCH v2 4/5] sched: UMCG: add a blocked worker list
From: Peter Zijlstra
Date:  Mon Jan 17 2022 - 04:20:05 EST
On Thu, Jan 13, 2022 at 03:39:39PM -0800, Peter Oskolkov wrote:
> The original idea of a UMCG server was that it was used as a proxy
> for a CPU, so if a worker associated with the server is RUNNING,
> the server itself is never ever was allowed to be RUNNING as well;
> when umcg_wait() returned for a server, it meant that its worker
> became BLOCKED.
> 
> In the new (old?) "per server runqueues" model implemented in
> the previous patch in this patchset, servers are woken when
> a previously blocked worker on their runqueue finishes its blocking
> operation, even if the currently RUNNING worker continues running.
> 
> As now a server may run while a worker assigned to it is running,
> the original idea of having at most a single worker RUNNING per
> server, as a means to control the number of running workers, is
> not really enforced, and the server, woken by a worker
> doing BLOCKED=>RUNNABLE transition, may then call sys_umcg_wait()
> with a second/third/etc. worker to run.
> 
> Support this scenario by adding a blocked worker list:
> when a worker transitions RUNNING=>BLOCKED, not only its server
> is woken, but the worker is also added to the blocked worker list
> of its server.
> 
> This change introduces the following benefits:
> - block detection how behaves similarly to wake detection;
>   without this patch worker wakeups added wakees to the list
>   and woke the server, while worker blocks only woke the server
>   without adding blocked workers to a list, forcing servers
>   to explicitly check worker's state;
> - if the blocked worker woke sufficiently quickly, the server
>   woken on the block event would observe its worker now as
>   RUNNABLE, so the block event had to be inferred rather than
>   explicitly signalled by the worker being added to the blocked
>   worker list;
> - it is now possible for a single server to control several
>   RUNNING workers, which makes writing userspace schedulers
>   simpler for smaller processes that do not need to scale beyond
>   one "server";
> - if the userspace wants to keep at most a single RUNNING worker
>   per server, and have multiple servers with their own runqueues,
>   this model is also naturally supported here.
> 
> So this change basically decouples block/wake detection from
> M:N threading in the sense that the number of servers is now
> does not have to be M or N, but is more driven by the scalability
> needs of the userspace application.
So I don't object to having this blocking list, we had that early on in
the discussions.
*However*, combined with WF_CURRENT_CPU this 1:N userspace model doesn't
really make sense, also combined with Proxy-Exec (if we ever get that
sorted) it will fundamentally not work.
More consideration is needed I think...