Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n

From: Nick Piggin
Date: Tue Sep 09 2008 - 03:59:48 EST


On Tuesday 09 September 2008 16:54, Peter Zijlstra wrote:
> On Tue, 2008-09-09 at 16:31 +1000, Nick Piggin wrote:
> > On Tuesday 09 September 2008 16:18, Peter Zijlstra wrote:
> > > I've been looking at the history of that function - it started out
> > > quite readable - but has, over the years, grown into a monstrosity.
> >
> > I agree it is terrible, and subsequent "features" weren't really properly
> > written or integrated into the sched domains idea.
> >
> > > Then there is this whole sched_group stuff, which I intent to have a
> > > hard look at, afaict its unneeded and we can iterate over the
> > > sub-domains just as well.
> >
> > What sub-domains? The domains-minus-groups are just a graph (in existing
> > setup code AFAIK just a line) of cpumasks. You have to group because you
> > want enough control for example not to pull load from an unusually busy
> > CPU from one group if it's load should actually be spread out over a
> > smaller domain (ie. probably other CPUs within the group we're looking
> > at).
> >
> > It would be nice if you could make it simpler of course, but I just don't
> > understand you or maybe you thought of some other way to solve this or
> > why it doesn't matter...
>
> Right, I get the domain stuff - that's good stuff.
>
> But, let my try and confuse you with ASCII-art ;-)
>
> Domain [0-7]
> group [0-3] group [4-7]
>
> Domain [0-3]
> group[0-1] [group2-3]
>
> Domain [0-1]
> group 0 group 1
>
> (right hand side not drawn due to lack of space etc...)
>
> So we have this tree of domains, which is cool stuff. But then we have
> these groups in there, which closely match up with the domain's child
> domains.

But it's all per-cpu, so you'd have to iterate down other CPU's child
domains. Which may get dirtied by that CPU. So you get cacheline
bounces.

You also lose flexibility (although nobody really takes full advantage
of it) of totally arbitrary topology on a per-cpu basis.


> So my idea was to ditch the groups and just iterate over the child
> domains.

I'm not saying you couldn't do it (reasonably well -- cacheline bouncing
might be a problem if you propose to traverse other CPU's domains), but
what exactly does that gain you?


> > > Finally, we should move all this stuff into sched_fair and get rid of
> > > that iterator interface and fix up all nr_running etc.. usages to refer
> > > to cfs.nr_running and similar.
> > >
> > > Then there is the idea Andi proposed, splitting up the performance and
> > > power balancer into two separate functions, something that is worth
> > > looking into imho.
> >
> > That's what *I* suggested. Before it even went in. Of course there was no
> > attempt made at all and it went in despite my reservations, but what's
> > new
> >
> > :)
>
> Even more reason to make it happen.

Yes it would be great if it happens.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/