Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n

From: Peter Zijlstra
Date: Tue Sep 09 2008 - 04:25:54 EST


On Tue, 2008-09-09 at 17:59 +1000, Nick Piggin wrote:
> On Tuesday 09 September 2008 16:54, Peter Zijlstra wrote:
> > On Tue, 2008-09-09 at 16:31 +1000, Nick Piggin wrote:
> > > On Tuesday 09 September 2008 16:18, Peter Zijlstra wrote:
> > > > I've been looking at the history of that function - it started out
> > > > quite readable - but has, over the years, grown into a monstrosity.
> > >
> > > I agree it is terrible, and subsequent "features" weren't really properly
> > > written or integrated into the sched domains idea.
> > >
> > > > Then there is this whole sched_group stuff, which I intent to have a
> > > > hard look at, afaict its unneeded and we can iterate over the
> > > > sub-domains just as well.
> > >
> > > What sub-domains? The domains-minus-groups are just a graph (in existing
> > > setup code AFAIK just a line) of cpumasks. You have to group because you
> > > want enough control for example not to pull load from an unusually busy
> > > CPU from one group if it's load should actually be spread out over a
> > > smaller domain (ie. probably other CPUs within the group we're looking
> > > at).
> > >
> > > It would be nice if you could make it simpler of course, but I just don't
> > > understand you or maybe you thought of some other way to solve this or
> > > why it doesn't matter...
> >
> > Right, I get the domain stuff - that's good stuff.
> >
> > But, let my try and confuse you with ASCII-art ;-)
> >
> > Domain [0-7]
> > group [0-3] group [4-7]
> >
> > Domain [0-3]
> > group[0-1] [group2-3]
> >
> > Domain [0-1]
> > group 0 group 1
> >
> > (right hand side not drawn due to lack of space etc...)
> >
> > So we have this tree of domains, which is cool stuff. But then we have
> > these groups in there, which closely match up with the domain's child
> > domains.
>
> But it's all per-cpu, so you'd have to iterate down other CPU's child
> domains. Which may get dirtied by that CPU. So you get cacheline
> bounces.

Humm, are you saying each cpu has its own domain tree? My understanding
was that its a global structure, eg. given:

domain[0-1]

domain[0] domain[1]

cpu0's parent domain is the same instance as cpu1's.

> You also lose flexibility (although nobody really takes full advantage
> of it) of totally arbitrary topology on a per-cpu basis.

Afaict the only flexibility you loose is that you cannot make groups
larger/smaller than the child domain - which given that the whole
premesis of the groups existence is that the inner-group balancing
should be done by the level below - doesn't make sense anyway.

> > So my idea was to ditch the groups and just iterate over the child
> > domains.
>
> I'm not saying you couldn't do it (reasonably well -- cacheline bouncing
> might be a problem if you propose to traverse other CPU's domains), but
> what exactly does that gain you?

Those cacheline bounces could be mitigated by splitting sched_domain
into two parts with a cacheline aligned dummy and keep the rarely
modified data separate from the frequently modified data.

As to the gains - a graph walk with a single type seems more elegant to
me.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/