We are not going to go back to the wild balancing that
numasched does (I have some benchmarks where sched-domains
reduces cross node task movement by several orders of
magnitude).
Agreed, I think that'd be a fatal mistake ...
So the other option is to do balance on clone
across NUMA nodes, and make it very sensitive to imbalance.
Or probably better to make it easy to balance off to an idle
CPU, but much more difficult to balance off to a busy CPU.
I think that's correct, but we need to be careful. We really, really do want to try to keep threads on the same node *if* we have enough processes around to keep the machine busy. Because we don't balance
on fork, we make a reasonable job of that today, but we should probably
be more reluctant on rebalance than we are.
It's when we have less processes than nodes that we want to spread things around. That's a difficult balance to strike (and exactly why I wimped out on it originally ;-)).