Re: [patch 2/2] sched: Scale the nohz_tracker logic by making itper NUMA node

From: Pallipadi, Venkatesh
Date: Mon Dec 14 2009 - 20:00:24 EST


On Mon, 2009-12-14 at 14:58 -0800, Peter Zijlstra wrote:
> On Mon, 2009-12-14 at 14:32 -0800, Pallipadi, Venkatesh wrote:
> >
> > The idea is to do idle balance only within the nodes.
> > Eg: 4 node (and 4 socket) system with each socket having 4 cores.
> > If there is a single active thread on such a system, say on socket 3.
> > Without this change we have 1 idle load balancer (which may be in socket
> > 0) which has periodic ticks and remaining 14 cores will be tickless.
> > But this one idle load balancer does load balance on behalf of itself +
> > 14 other idle cores.
> >
> > With the change proposed in this patch, we will have 3 completely idle
> > nodes/sockets. We will not do load balance on these cores at all.
>
> That seems like a behavioural change, not balancing these 3 nodes at all
> could lead to overload scenarios on the one active node, right?
>

Yes. You are right. This can result in some node level imbalance. The
main problem that we were trying to solve is over-aggressive attempt to
load balance idle CPUs. We have seen on a system with 64 logical CPUs,
if there is only active thread, we have seen one other CPU (the idle
load balancer) spending 3-5% time being non-idle just trying to do load
balance on behalf of 63 idle CPUs on a continuous basis. Trying idle
rebalance every jiffy across all nodes when balance across nodes has
interval of 8 or 16 jiffies. There are other forms of rebalancing like
fork and exec that will still balance across nodes. But, if there are no
forks/execs, we will have the overload scenario you pointed out.

I guess we need to look at other alternatives to make this cross node
idle load balancing more intelligent. However, first patch in this
series has its share of advantages in avoiding unneeded idle balancing.
And with first patch, cross node issues will be no worse than current
state. So, that is worth as a stand alone change as well.

> > Remaining one active socket will have one idle load balancer, which when
> > needed will do idle load balancing on behalf of itself + 2 other idle
> > cores in that socket.
>
> > If there all sockets have atleast one busy core, then we may have more
> > than one idle load balancer, but each will only do idle load balance on
> > behalf of idle processors in its own node, so total idle load balance
> > will be same as now.
>
> How about things like Magny-Cours which will have multiple nodes per
> socket, wouldn't that be best served by having the total socket idle,
> instead of just half of it?
>

Yes. But, that will be same with general load balancing behavior and not
just idle load balancing. That would probably need another level in
scheduler domain?

Thanks,
Venki

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/