Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path

From: Dietmar Eggemann
Date: Mon May 15 2017 - 10:56:27 EST


On 12/05/17 21:57, Jeffrey Hugo wrote:
> On 5/12/2017 2:47 PM, Peter Zijlstra wrote:
>> On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d711093..8f783ba 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8219,8 +8219,19 @@ static int load_balance(int this_cpu, struct
>>> rq *this_rq,
>>> /* All tasks on this runqueue were pinned by CPU affinity */
>>> if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>> + struct cpumask tmp;
>>
>> You cannot have cpumask's on stack.
>
> Well, we need a temp variable to store the intermediate values since the
> cpumask_* operations are somewhat limited, and require a "storage"
> parameter.
>
> Do you have any suggestions to meet all of these requirements?

What about we use env.dst_grpmask and check if cpus is an improper
subset of env.dst_grpmask? In this case we have to get rid of
setting env.dst_grpmask = NULL in case of CPU_NEWLY_IDLE which is
IMHO not an issue since it's idle is passed via env into
can_migrate_task().
And cpus has to be and'ed with sched_domain_span(env.sd).

I'm not sure if this will work with 'not fully connected NUMA' (SD_OVERLAP)
though ...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276fcb62..2ede4c1c9db8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* our sched_group. We may want to revisit it if we couldn't
* meet load balance goals by pulling other tasks on src_cpu.
*
- * Also avoid computing new_dst_cpu if we have already computed
- * one in current iteration.
+ * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
+ * already computed one in current iteration.
*/
- if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
+ if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
return 0;

/* Prevent to re-select dst_cpu via env's cpus */
@@ -8091,14 +8091,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.tasks = LIST_HEAD_INIT(env.tasks),
};

- /*
- * For NEWLY_IDLE load_balancing, we don't need to consider
- * other cpus in our group
- */
- if (idle == CPU_NEWLY_IDLE)
- env.dst_grpmask = NULL;
-
- cpumask_copy(cpus, cpu_active_mask);
+ cpumask_and(cpus, cpu_active_mask, sched_domain_span(env.sd));

schedstat_inc(sd->lb_count[idle]);

@@ -8220,7 +8213,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(env.flags & LBF_ALL_PINNED)) {
cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
+ if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
goto redo;