Re: 2.6.4-mm1

From: Nick Piggin
Date: Fri Mar 12 2004 - 09:40:07 EST

Andi Kleen wrote:

On Fri, Mar 12, 2004 at 03:24:43PM +1100, Nick Piggin wrote:

Andi Kleen wrote:

On Thu, Mar 11, 2004 at 07:04:50PM -0800, Nakajima, Jun wrote:

As we can have more complex architectures in the future, the scheduler
is flexible enough to represent various scheduling domains effectively,
and yet keeps the common scheduler code simple.

I think for SMT alone it's too complex and for NUMA it doesn't do
the right thing for "modern NUMAs" (where NUMA factor is very low
and you have a small number of CPUs for each node).

For SMT it is a less complex than shared runqueues, it is actually
less lines of code and smaller object size.

By moving all the complexity into arch/* ?

Well you have a point in a way. At least it is configurable, per
arch, and done in setup __init code. The whole point really was
to move the complexity to arch/* (or they can just use the default
setup, obviously).

It is also more flexible than shared runqueues in that you can still
have control over each sibling's runqueue. Con's SMT nice patch for
example would probably be more difficult to do with shared runqueues.
Shared runqueues also gives zero affinity to siblings. While current
implementations may not (do they?) care, future ones might.

For Opteron type NUMA, it actually balances much more aggressively
than the default NUMA scheduler, especially when a CPU is idle. I
don't doubt you aren't seeing great performance, but it should be
able to be fixed.

The problem is just presumably your lack of time to investigate
further, and my lack of problem descriptions or Opterons.

I didn't investigate further on your scheduler because I have my doubts about it being the right approach and it seems to have
some obvious design bugs (like the racy SMT setup)

If you have any ideas about other approaches I would be interested
to hear them...

Setup needs some work, yes. It isn't a fundamental problem.

The problem description is still the same as it was in the past.

Basically it is: schedule as on SMP, but avoid local affinity for newly
created tasks and balance early. Allow to disable all old style NUMA heuristics.

That is pretty much what it does now. Apart from moving newly created
tasks. I think you're pretty brave for wanting to move new *threads*
off node. If anything, they are the most likely possible thing to
share memory. But I could add a sched_balance_fork which you can turn
on if you like.

Longer term some homenode scheduling affinity may be still useful,
but I tried to get that to work on 2.4 and failed, so I'm not sure
it can be done. The right way may be to keep track how much memory
each thread allocated on each node and preferably schedule on
the node with the most memory. But that's future work.

Yeah. There is no reason why the scheduler should perform worse than
2.4 for you. We have to get to the bottom of it.

One thing you definitely want is a sched_balance_fork, is that right?
Have you been able to do any benchmarks on recent -mm kernels?

I sent the last benchmarks I did to you (including the tweaks you
suggested). All did worse than the standard scheduler. Did you change anything significant that makes rebenchmarking useful?

Yeah thanks for those. There have been quite a few changes and fixes
to the scheduler since then, so I think it would be worth re-testing.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at