Re: Plumbers: Tweaking scheduler policy micro-conf RFP

From: Peter Zijlstra
Date: Tue May 15 2012 - 11:36:08 EST


On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
> On 15 May 2012 15:00, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
> >>
> >> Not sure that nobody cares but it's much more that scheduler,
> >> load_balance and sched_mc are sensible enough that it's difficult to
> >> ensure that a modification will not break everything for someone
> >> else.
> >
> > Thing is, its already broken, there's nothing else to break :-)
> >
>
> sched_mc is the only power-aware knob in the current scheduler. It's
> far from being perfect but it seems to work on some ARM platform at
> least. You mentioned at the scheduler mini-summit that we need a
> cleaner replacement and everybody has agreed on that point. Is anybody
> working on it yet ?

Apparently not..

> and can we discuss at Plumber's what this replacement would look like ?

one knob: sched_balance_policy with tri-state {performance, power, auto}

Where auto should likely look at things like are we on battery and
co-ordinate with cpufreq muck or whatever.

Per domain knobs are insane, large multi-state knobs are insane, the
existing scheme is therefore insane^2. Can you find a sysad who'd like
to explore 3^3=27 states for optimal power/perf for his workload on a
simple 2 socket hyper-threaded machine and 3^4=81 state space for 8
sockets etc..?

As to the exact policy, I think the current 2 (load-balance + wakeup) is
the sensible one..

Also, I still have this pending email from you asking about the topology
setup stuff I really need to reply to.. but people keep sending me bugs
reports :/

But really short, look at kernel/sched/core.c:default_topology[]

I'd like to get rid of sd_init_* into a single function like
sd_numa_init(), this would mean all archs would need to do is provide a
simple list of ever increasing masks that match their topology.

To aid this we can add some SDTL_flags, initially I was thinking of:

SDTL_SHARE_CORE -- aka SMT
SDTL_SHARE_CACHE -- LLC cache domain (typically multi-core)
SDTL_SHARE_MEMORY -- NUMA-node (typically socket)

The 'performance' policy is typically to spread over shared resources so
as to minimize contention on these.

If you want to add some power we need some extra flags, maybe something
like:

SDTL_SHARE_POWERLINE -- power domain (typically socket)

so you know where the boundaries are where you can turn stuff off so you
know what/where to pack bits.

Possibly we also add something like:

SDTL_PERF_SPREAD -- spread on performance mode
SDTL_POWER_PACK -- pack on power mode

To over-ride the defaults. But ideally I'd leave those until after we've
got the basics working and there is a clear need for them (with a
spread/pack default for perf/power aware).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/