Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Mel Gorman
Date: Thu May 21 2020 - 06:38:27 EST


On Fri, May 15, 2020 at 04:24:44PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> > On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> >
> > > However, the wakeups are so rapid that the wakeup
> > > happens while the server is descheduling. That forces the waker to spin
> > > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > > add the task to the rq->wake_list even if that potentially requires an IPI.
> >
> > Right, I think Rik ran into that as well at some point. He wanted to
> > make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
> > cpu (which is currently in the middle of schedule()) should be an easier
> > proposition.
> >
> > Maybe something like this untested thing... could explode most mighty,
> > didn't thing too hard.
> >
>
> Mel pointed out that that patch got mutilated somewhere (my own .Sent
> copy was fine), let me try again.
>

Sorry for the slow response. My primary work machine suffered a
catastrophic failure on Sunday night which is a fantastic way to start
a Monday morning so I'm playing catchup.

IIUC, this patch front-loads as much work as possible before checking if
the task is on_rq and then the waker/wakee shares a cache, queue task on
the wake_list and otherwise do a direct wakeup.

The advantage is that spinning is avoided on p->on_rq when p does not
share a cache. The disadvantage is that it may result in tasks being
stacked but this should only happen when the domain is overloaded and
select_task_eq() is unlikely to find an idle CPU. The load balancer would
soon correct the situation anyway.

In terms of netperf for my testing, the benefit is marginal because the
wakeups are primarily between tasks that share cache. It does trigger as
perf indicates that some time is spent in ttwu_queue_remote with this
patch, it's just that the overall time spent spinning on p->on_rq is
very similar. I'm still waiting on other workloads to complete to see
what the impact is.

However, intuitively at least, it makes sense to avoid spinning on p->on_rq
when it's unnecessary and the other changes appear to be safe. Even if
wake_list should be used in some cases for local wakeups, it would make
sense to put that on top of this patch. Do you want to slap a changelog
around this and update the comments or do you want me to do it? I should
have more results in a few hours even if they are limited to one machine
but ideally Rik would test his workload too.

--
Mel Gorman
SUSE Labs