Re: [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path

From: Morten Rasmussen
Date: Thu Jun 23 2016 - 05:55:11 EST


On Wed, Jun 22, 2016 at 02:04:11PM -0400, Rik van Riel wrote:
> On Wed, 2016-06-22 at 18:03 +0100, Morten Rasmussen wrote:
> > In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
> > select_task_rq() got a 'cpu' argument to enable overriding of
> > prev_cpu
> > in special cases (NUMA task swapping). However, the
> > select_task_rq_fair() helper functions: wake_affine() and
> > select_idle_sibling(), still use task_cpu(p) directly to work out
> > prev_cpu which leads to inconsistencies.
> >
> > This patch passes prev_cpu (potentially overridden by NUMA code) into
> > the helper functions to ensure prev_cpu is indeed the same cpu
> > everywhere in the wakeup path.
> >
> > cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > cc: Rik van Riel <riel@xxxxxxxxxx>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > ---
> >  kernel/sched/fair.c | 24 +++++++++++++-----------
> >  1 file changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c6dd8bab010c..eec8e29104f9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq,
> > struct sched_entity *se)
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > -static int select_idle_sibling(struct task_struct *p, int cpu);
> > +static int select_idle_sibling(struct task_struct *p, int prev_cpu,
> > int cpu);
> >  static unsigned long task_h_load(struct task_struct *p);
> >  
> >  /*
> > @@ -1483,7 +1483,8 @@ static void task_numa_compare(struct
> > task_numa_env *env,
> >    * Call select_idle_sibling to maybe find a better one.
> >    */
> >   if (!cur)
> > - env->dst_cpu = select_idle_sibling(env->p, env-
> > >dst_cpu);
> > + env->dst_cpu = select_idle_sibling(env->p, env-
> > >src_cpu,
> > +    env->dst_cpu);
>
> It is worth remembering that "prev" will only
> ever be returned by select_idle_sibling() if
> it is part of the same NUMA node as target.
>
> That means this patch does not change behaviour
> of the NUMA balancing code, since that always
> migrates between nodes.
>
> Now lets look at try_to_wake_up(). It will pass
> p->wake_cpu as the argument for "prev_cpu", which
> again appears to be the same CPU number as that used
> by the current code.

IIUC, p->wake_cpu != task_cpu(p) if task_numa_migrate() decided to call
migrate_swap() on the task while it was sleeping intending it to swap
places with a task on a different NUMA node when it wakes up. Using
p->wake_cpu in select_idle_sibling() as "prev_cpu" when called through
try_to_wake_up()->select_task_rq() should only make a difference if the
target cpu happens to share cache with it and it is idle.

if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
return prev;

The selection of the target cpu for select_idle_sibling() is also
slightly affected as wake_affine() currently compares task_cpu(p) and
smp_processor_id(), and then picks p->wake_cpu or smp_processor_id()
depending on the outcome. With this patch wake_affine() uses
p->wake_cpu instead of task_cpu(p) so we actually compare the candidates
we choose between.

I think that would lead to some minor changes in behaviour in a few
corner cases, but I mainly wrote the patch as I thought it was very
confusing that we could have different "prev_cpu"s in different parts of
the select_task_rq_fair() code path.

>
> I have no objection to your patch, but must be
> overlooking something, since I cannot find a change
> in behaviour that your patch would create.

Thanks for confirming that it shouldn't change anything for NUMA load
balancing. That is what I hope for :-)