Re: [PATCH] sched/fair: do not scan twice in detach_tasks()

From: Shijie Huang
Date: Wed Jul 16 2025 - 22:57:25 EST



On 2025/7/16 23:08, Valentin Schneider wrote:
On 16/07/25 10:13, Shijie Huang wrote:
On 2025/7/16 1:04, Valentin Schneider wrote:
On 07/07/25 16:36, Huang Shijie wrote:
When detach_tasks() scans the src_cpu's task list, the task list
may shrink during the scanning. For example, the task list
may have four tasks at the beginning, it may becomes to two
during the scanning in detach_tasks():
Task list at beginning : "ABCD"
Task list in scanning : "CD"

(ABCD stands for differnt tasks.)

In this scenario, the env->loop_max is still four, so
detach_tasks() may scan twice for some tasks:
the scanning order maybe : "DCDC"

Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
about one occurrence every two default hackbench run (~200ms) on my dummy
QEMU setup.

Have you seen this happen on your workloads or did you find this while
staring at the code?
I found this issue in my Specjbb2015 test.  It is very easy to trigger.

That would be good to include in the changelog.
okay.
I noticed many times in the test.

I even found that:

        At the beginning: env->loop_max is 12.

       When the detach_tasks() scans: the real task list shrink to 5.

That's set using rq->nr_running which includes more than just the CFS
tasks, and looking at the git history it looks like it's almost always been
the case.

Looks like env->loop_max really is only used for detach_tasks(), so perhaps
a "better" fix would be something like the below, so that we can't iterate
more than length(env->src_rq->cfs_tasks). That is, assuming

rq->cfs.h_nr_queued == length(rq->cfs_tasks)

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9b4bbbf0af6f..32ae24aa377ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
* still unbalanced. ld_moved simply stays zero, so it is
* correctly treated as an imbalance.
*/
- env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
+ env.loop_max = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);

I tested this patch, it did not work. I still can catch lots of occurrences of this issue in Specjbb test.


IMHO, the root cause of this issue is env.loop_max is set out of the rq's lock.

Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks length still can shrink in

other places.


Btw, the real tasks's length can also bigger then env.loop_max.

 For example, we set env.loop_max with 5, when detach_tasks() runs, the real tasks' length can be changed to 7.


Thanks

Huang Shijie

more_balance:
rq_lock_irqsave(busiest, &rf);