Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

From: Michael Holzheu
Date: Fri Aug 19 2016 - 05:52:28 EST


Am Thu, 18 Aug 2016 10:42:08 -0400
schrieb Tejun Heo <tj@xxxxxxxxxx>:

> Hello, Michael.
>
> On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> > Well, "no requirement" this is not 100% correct. Currently we use
> > the CPU topology information to assign newly coming CPUs to the
> > "best fitting" node.
> >
> > Example:
> >
> > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
> > assignment:
> >
> > - N1: cpu 1 on chip 1
> > - N2: cpu 2 on chip 2
> >
> > 2) A new cpu 3 is configured that lives on chip 2
> > 3) We assign cpu 3 to N2
> >
> > We do this only if the nodes are balanced. If N2 had already one
> > more cpu than N1 we would assign the new cpu to N1.
>
> I see. Out of curiosity, what's the purpose of fakenuma on s390?
> There don't seem to be any actual memory locality concerns. Is it
> just to segment memory of a machine into multiple pieces?

Correct.

> If so, why
> is that necessary, do you hit some scalability issues w/o NUMA nodes?

Yes we hit a scalability issue. Our performance team found out that for
big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we
see problems:

- Zone locks are highly contended because ZONE_NORMAL is big:
* zone->lock
* zone->lru_lock
- One kswapd is not enough for swapping

We hope that those problems are resolved by fake NUMA because for each
node a separate memory subsystem is created with separate zone locks
and kswapd threads.

> As for the solution, if blind RR isn't good enough, although it sounds
> like it could given that the balancing wasn't all that strong to begin
> with, would it be an option to implement an interface which just
> requests a new CPU rather than a specific one and then pick one of the
> vacant possible CPUs considering node balancing?

IMHO this is a promising idea. To say it in my words:

- At boot time we already pin all remaining "not configured" logical
CPUs to nodes. So all possible cpus are pinned to nodes and
cpu_to_node() will work.

- If a new physical cpu get's configured, we get the CPU topology
information from the system and find the best node.

- We get a logical cpu number from the node pool and assign the
new physical cpu to that number.

If that works we would be as good as before. We will have a look into
the code if it is possible.

Michael