Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

From: Vincent Guittot
Date: Wed Mar 20 2024 - 12:58:09 EST


Hi Josh,

Sorry for the late reply.

On Mon, 12 Feb 2024 at 21:29, Josh Don <joshdon@xxxxxxxxxx> wrote:
>
> Hi Vincent,
>
> On Thu, Aug 25, 2022 at 5:27 AM Vincent Guittot
> <vincent.guittot@xxxxxxxxxx> wrote:
> >
> > During load balance, we try at most env->loop_max time to move a task.
> > But it can happen that the loop_max LRU tasks (ie tail of
> > the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> > In this case, loop in the list until we found at least one.
>
> We had a user recently trigger a hard lockup which we believe is due
> to this patch. The user in question had O(10k) threads affinitized to
> a cpu; seems like the process had an out of control thread spawning
> issue, and was in the middle of getting killed. However, that was
> being slowed down due to the fact that load balance was iterating all

Does it mean that it was progressing but not as fast as you would like

> these threads and bouncing the rq lock (and making no progress due to
> ALL_PINNED). Before this patch, load balance would quit after hitting
> loop_max.
>
> Even ignoring that specific instance, it seems pretty easy for this
> patch to cause a softlockup due to a buggy or malicious process.

The fact that the rq is released regularly should prevent a
softlockup. And we could even fasten can_migrate() which does a lot of
useless stuff for task affined to 1 cpu.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8270e2e15cb..15bc1067c69d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8920,6 +8920,8 @@ int can_migrate_task(struct task_struct *p,
struct lb_env *env)

lockdep_assert_rq_held(env->src_rq);

+ if (p->nr_cpus_allowed == 1)
+ return 0;
/*
* We do not migrate tasks that are:
* 1) throttled_lb_pair, or

>
> For the tradeoff you were trying to make in this patch (spend more
> time searching in the hopes that there's something migratable further
> in the list), perhaps it would be better to adjust
> sysctl.sched_nr_migrate instead of baking this into the kernel?

That could be a solution but this increases the iterations for all
cases including those which are more time consuming to sort out and
the number of tasks that you will migrate in one lb. The latter is the
one which consumes time

Vincent

>
> Best,
> Josh
>
> >
> > The maximum of detached tasks remained the same as before.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 12 +++++++++---
> > 1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index da388657d5ac..02b7b808e186 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
> > p = list_last_entry(tasks, struct task_struct, se.group_node);
> >
> > env->loop++;
> > - /* We've more or less seen every task there is, call it quits */
> > - if (env->loop > env->loop_max)
> > + /*
> > + * We've more or less seen every task there is, call it quits
> > + * unless we haven't found any movable task yet.
> > + */
> > + if (env->loop > env->loop_max &&
> > + !(env->flags & LBF_ALL_PINNED))
> > break;
> >
> > /* take a breather every nr_migrate tasks */
> > @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >
> > if (env.flags & LBF_NEED_BREAK) {
> > env.flags &= ~LBF_NEED_BREAK;
> > - goto more_balance;
> > + /* Stop if we tried all running tasks */
> > + if (env.loop < busiest->nr_running)
> > + goto more_balance;
> > }
> >
> > /*
> > --
> > 2.17.1
> >