Re: Plumbers: Tweaking scheduler policy micro-conf RFP

From: Vincent Guittot
Date: Wed May 16 2012 - 17:21:04 EST


On 15 May 2012 17:35, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
>> On 15 May 2012 15:00, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
>> >>
>> >> Not sure that nobody cares but it's much more that scheduler,
>> >> load_balance and sched_mc are sensible enough that it's difficult to
>> >> ensure that a modification will not break everything for someone
>> >> else.
>> >
>> > Thing is, its already broken, there's nothing else to break :-)
>> >
>>
>> sched_mc is the only power-aware knob in the current scheduler. It's
>> far from being perfect but it seems to work on some ARM platform at
>> least. You mentioned at the scheduler mini-summit that we need a
>> cleaner replacement and everybody has agreed on that point. Is anybody
>> working on it yet ?
>
> Apparently not..
>
>> and can we discuss at Plumber's what this replacement would look like ?
>
> one knob: sched_balance_policy with tri-state {performance, power, auto}
>
> Where auto should likely look at things like are we on battery and
> co-ordinate with cpufreq muck or whatever.

IIUC performance and power will be platform and architecture agnostic
and will only rely on a "simple" cpu topology description and auto
mode would exchange information with framework like cpufreq which can
provide some platform specific information like a clock rate
dependency.

>
> Per domain knobs are insane, large multi-state knobs are insane, the
> existing scheme is therefore insane^2. Can you find a sysad who'd like
> to explore 3^3=27 states for optimal power/perf for his workload on a
> simple 2 socket hyper-threaded machine and 3^4=81 state space for 8
> sockets etc..?
>
> As to the exact policy, I think the current 2 (load-balance + wakeup) is
> the sensible one..
>
> Also, I still have this pending email from you asking about the topology
> setup stuff I really need to reply to.. but people keep sending me bugs
> reports :/
>

I'm interested to get feedback when you will have time

> But really short, look at kernel/sched/core.c:default_topology[]
>
> I'd like to get rid of sd_init_* into a single function like
> sd_numa_init(), this would mean all archs would need to do is provide a
> simple list of ever increasing masks that match their topology.
>
> To aid this we can add some SDTL_flags, initially I was thinking of:
>
>  SDTL_SHARE_CORE        -- aka SMT
>  SDTL_SHARE_CACHE       -- LLC cache domain (typically multi-core)
>  SDTL_SHARE_MEMORY      -- NUMA-node (typically socket)
>
> The 'performance' policy is typically to spread over shared resources so
> as to minimize contention on these.
>
> If you want to add some power we need some extra flags, maybe something
> like:
>
>  SDTL_SHARE_POWERLINE   -- power domain (typically socket)
>
> so you know where the boundaries are where you can turn stuff off so you
> know what/where to pack bits.

I'm not sure to see how this flag will be used compared to the others.
The first 3 SDTL_SHARE_XXX about topology are exclusive and described
different level of CPU but the SDTL_SHARE_POWERLINE could be used at
each level to describe is CPU in the sched_domain are sharing or not
the power domain

>
> Possibly we also add something like:
>
>  SDTL_PERF_SPREAD       -- spread on performance mode
>  SDTL_POWER_PACK        -- pack on power mode
>
> To over-ride the defaults. But ideally I'd leave those until after we've
> got the basics working and there is a clear need for them (with a
> spread/pack default for perf/power aware).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/