Re: [Lse-tech] Re: [patch] scheduler fix for 1cpu/node case

From: Andrew Theurer (habanero@us.ibm.com)
Date: Tue Jul 29 2003 - 08:33:05 EST


On Tuesday 29 July 2003 05:08, Erich Focht wrote:
> On Tuesday 29 July 2003 04:24, Andrew Theurer wrote:
> > On Monday 28 July 2003 15:37, Martin J. Bligh wrote:
> > > > But the Hammer is a NUMA architecture and a working NUMA scheduler
> > > > should be flexible enough to deal with it. And: the corner case of 1
> > > > CPU per node is possible also on any other NUMA platform, when in
> > > > some of the nodes (or even just one) only one CPU is configured in.
> > > > Solving that problem automatically gives the Hammer what it needs.
> >
> > I am going to ask a silly question, do we have any data showing this
> > really is a problem on AMD? I would think, even if we have an idle cpu,
> > sometimes a little delay on task migration (on NUMA) may not be a bad
> > thing. If it is too long, can we just make the rebalance ticks arch
> > specific?
>
> The fact that global rebalances are done only in the timer interrupt
> is simply bad!

Even with this patch it still seems that most balances are still timer based,
because we still call load_balance in rebalance_tick. Granted, we may
inter-node balance more often, well, maybe less often since
node_busy_rebalance_tick was busy_rebalance_tick*2. I do see the advantage
of doing this at idle, but idle only, that's why I'd would be more inclined a
only a much more aggressive idle rebalance.

> It complicates rebalance_tick() and wastes the
> opportunity to get feedback from the failed local balance attempts.

What does "failed" really mean? To me, when *busiest=null, that means we
passed, the node itself is probably balanced, and there's nothing to do. It
gives no indication at all of the global load [im]balance. Shouldn't the
thing we are looking for is the imbalance among node_nr_running[]? Would it
make sense to go forward with a global balance based on that only?

> If you want data supporting my assumptions: Ted Ts'o's talk at OLS
> shows the necessity to rebalance ASAP (even in try_to_wake_up).

If this is the patch I am thinking of, it was the (attached) one I sent them,
which did a light "push" rebalance at try_to_wake_up. Calling load_balance
at try_to_wake_up seems very heavy-weight. This patch only looks for an idle
cpu (within the same node) to wake up on before task activation, only if the
task_rq(p)->nr_running is too long. So, yes, I do believe this can be
important, but I think it's only called for when we have an idle cpu.

> There
> are plenty of arguments towards this, starting with the steal delay
> parameter scans from the early days of multi-queue schedulers (Davide
> Libenzi), over my experiments with NUMA schedulers and the observation
> of Andi Kleen that on Opteron you better run from the wrong CPU than
> wait (if the tasks returns to the right cpu, all's fine anyway).
>
> > I'd much rather have info related to the properties of the NUMA arch than
> > something that makes decisions based on nr_cpus_node(). For example, we
> > may want to inter-node balance as much or more often on ppc64 than even
> > AMD, but it has 8 cpus per node. On this patch it would has the lowest
> > inter-node balance frequency, even though it probably has one of the
> > lowest latencies between nodes and highest throughput interconnects.
>
> We can still discuss on the formula. Currently there's a bug in the
> scheduler and the corner case of 1 cpu/node is just broken. The
> function local_balance_retries(attempts, cpus_in_this_node) must
> return 0 for cpus_in_this_node=1 and should return larger values for
> larger cpus_in_this_node. To have an upper limit is fine, but maybe
> not necessary.
>
> Regarding the ppc64 interconnect, I'm glad that you said "probably"
> and "one of the ...". No need to point you to better ones ;-)

OK, we wont get into a pissing match :) I just wanted to base the scheduler
decisions on data related to the hardware NUMA properties, not the cpu count.

> > > Right, I realise that the 1 cpu per node case is broken. However,
> > > doesn't this also affect the multi-cpu per node case in the manner I
> > > suggested above? So even if we turn off NUMA scheduler for Hammer, this
> > > still needs fixing, right?
> >
> > Maybe so, but if we start making idle rebalance more aggressive, I think
> > we would need to make CAN_MIGRATE more restrictive, taking memory
> > placement of the tasks in to account. On AMD with interleaved memory
> > allocation, tasks would move very easily, since their memory is spread
> > out anyway. On "home node" or node-local policy, we may not move a task
> > (or maybe not on the first attempt), even if there is an idle cpu in
> > another node.
>
> Aehm, that's another story and I'm sure we will fix that too. There
> are a few ideas around. But you shouldn't expect to solve all problems
> at once, after all optimal NUMA scheduling can still be considered a
> research area.
>
> > Personally, I'd like to see all systems use NUMA sched, non NUMA systems
> > being a single node (no policy difference from non-numa sched), allowing
> > us to remove all NUMA ifdefs. I think the code would be much more
> > readable.
> >
> :-) Then you expect that everybody who reads the scheduler code knows
>
> what NUMA stands for and what it means? Pretty optimistic, but yes,
> I'd like that, too.

Yes, at some point we have to. We cannot have two different schedulers. Non
numa should have the exact same scheduling policy as a numa system with one
node. I don't know if that's acceptable for 2.6, but I really want to go for
that in 2.7.

-Andrew Theurer



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jul 31 2003 - 22:00:40 EST