Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Vincent Guittot
Date: Tue Jan 07 2020 - 11:00:48 EST


busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;

On Tue, 7 Jan 2020 at 12:56, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Jan 07, 2020 at 12:17:12PM +0100, Vincent Guittot wrote:
> > On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote:
> > > > > > This looks weird to me because you use imbalance_pct, which is
> > > > > > meaningful only compare a ratio, to define a number that will be then
> > > > > > compared to a number of tasks without taking into account the weight
> > > > > > of the node. So whatever the node size, 32 or 128 CPUs, the
> > > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > > > > > NUMA level which is 125 AFAICT
> > > > > >
> > > > >
> > > > > The intent in this version was to only cover the low utilisation case
> > > > > regardless of the NUMA node size. There were too many corner cases
> > > > > where the tradeoff of local memory latency versus local memory bandwidth
> > > > > cannot be quantified. See Srikar's report as an example but it's a general
> > > > > problem. The use of imbalance_pct was simply to find the smallest number of
> > > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit
> > > >
> > > > But using imbalance_pct alone doesn't mean anything.
> > >
> > > Other than figuring out "how many running tasks are required before
> > > imbalance_pct is roughly equivalent to one fully active CPU?". Even then,
> >
> > sorry, I don't see how you deduct this from only using imbalance_pct
> > which is mainly there to say how much percent of difference is
> > significant
> >
>
> Because if the difference is 25% then 1 CPU out of 4 active is enough
> for imbalance_pct to potentially be a factor. Anyway, the approach seems
> almost universally disliked so even if I had reasons for not scaling
> this by the group_weight, no one appears to agree with them :P
>
> > > it's a bit weak as imbalance_pct makes hard-coded assumptions on what
> > > the tradeoff of cross-domain migration is without considering the hardware.
> > >
> > > > Using similar to the below
> > > >
> > > > busiest->group_weight * (env->sd->imbalance_pct - 100) / 100
> > > >
> > > > would be more meaningful
> > > >
> > >
> > > It's meaningful to some sense from a conceptual point of view but
> > > setting the low utilisation cutoff depending on the number of CPUs on
> > > the node does not account for any local memory latency vs bandwidth.
> > > i.e. on a small or mid-range machine the cutoff will make sense. On
> > > larger machines, the cutoff could be at the point where memory bandwidth
> > > is saturated leading to a scenario whereby upgrading to a larger
> > > machine performs worse than the smaller machine.
> > >
> > > Much more importantly, doing what you suggest allows an imbalance
> > > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
> > > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
> > > caches. That is going to have side-effects. While I also do not account
> > > for the LLC group_weight, it's unlikely the cut-off I used would be
> > > smaller than an LLC cache on a large machine as the cache.
> > >
> > > These two points are why I didn't take the group weight into account.
> > >
> > > Now if you want, I can do what you suggest anyway as long as you are happy
> > > that the child domain weight is also taken into account and to bound the
> >
> > Taking into account child domain makes sense to me, but shouldn't we
> > take into account the number of child group instead ? This should
> > reflect the number of different LLC caches.
>
> I guess it would but why is it inherently better? The number of domains
> would yield a similar result if we assume that all the lower domains
> have equal weight so it simply because the weight of the SD_NUMA domain
> divided by the number of child domains.

but that's not what you are doing in your proposal. You are using
directly child->span_weight which reflects the number of CPUs in the
child and not the number of group

you should do something like sds->busiest->span_weight /
sds->busiest->child->span_weight which gives you an approximation of
the number of independent group inside the busiest numa node from a
shared resource pov

>
> Now, I could be missing something with asymmetric setups. I don't know
> if it's possible for child domains of a NUMA domain to have different
> sizes. I would be somewhat surprised if they did but I also do not work
> on such machines nor have I ever accessed one (to my knowledge).
>
> > IIUC your reasoning, we want to make sure that tasks will not start to
> > fight for using same resources which is the connection between LLC
> > cache and memory in this case
> >
>
> Yep. I don't want a case where the allowed imbalance causes the load
> balancer to have to balance between the lower domains. *Maybe* that is
> actually better in some cases but it's far from intuitive so I would
> prefer that change would be a patch on its own with a big fat comment
> explaining the reasoning behind the additional complexity.
>
> > > largest possible allowed imbalance to deal with the case of a node having
> > > multiple small LLC caches. That means that some machines will be using the
> > > size of the node and some machines will use the size of an LLC. It's less
> > > predictable overall as some machines will be "special" relative to others
> > > making it harder to reproduce certain problems locally but it would take
> > > imbalance_pct into account in a way that you're happy with.
> > >
> > > Also bear in mind that whether LLC is accounted for or not, the final
> > > result should be halved similar to the other imbalance calculations to
> > > avoid over or under load balancing.
> > >
> > > > Or you could use the util_avg so you will take into account if the
> > > > tasks are short running ones or long running ones
> > > >
> > >
> > > util_avg can be skewed if there are big outliers. Even then, it's not
> > > a great metric for the low utilisation cutoff. Large numbers of mostly
> > > idle but running tasks would be treated similarly to small numbers of
> > > fully active tasks. It's less predictable and harder to reason about how
> >
> > Yes but this also have the advantage of reflecting more accurately how
> > the system is used.
> > with nr_running, we consider that mostly idle and fully active tasks
> > will have the exact same impact on the memory
> >
>
> Maybe, maybe not. When there is spare capacity in the domain overall and
> we are only interested in the low utilisation case, it seems to me that
> we should consider the most obvious and understandable metric. Now, if we
> were talking about a nearly fully loaded domain or an overloaded domain
> then I would fully agree with you as balancing utilisation in that case
> becomes critical.
>
> > > load balancing behaves across a variety of workloads.
> > >
> > > Based on what you suggest, the result looks like this (build tested
> > > only)
> >
> > I'm going to make a try of this patch
> >
>
> Thanks. I've queued the same patch on one machine to see what falls out.
> I don't want to tie up half my test grid until we get some sort of
> consensus.
>
> --
> Mel Gorman
> SUSE Labs