Re: [PATCH v2] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails

From: Chen, Yu C
Date: Wed Jul 16 2025 - 11:58:59 EST


On 7/16/2025 7:25 PM, Peter Zijlstra wrote:
On Tue, Jul 15, 2025 at 06:08:43PM +0800, Chen, Yu C wrote:
On 7/15/2025 3:08 PM, kernel test robot wrote:


Hello,

kernel test robot noticed a 22.9% regression of unixbench.throughput on:


commit: ac34cb39e8aea9915ec2f4e08c979eb2ed1d7561 ("[PATCH v2] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails")
url: https://github.com/intel-lab-lkp/linux/commits/Chris-Mason/sched-fair-bump-sd-max_newidle_lb_cost-when-newidle-balance-fails/20250626-224805
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 5bc34be478d09c4d16009e665e020ad0fcd0deea
patch link: https://lore.kernel.org/all/20250626144017.1510594-2-clm@xxxxxx/
patch subject: [PATCH v2] sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails

testcase: unixbench
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

runtime: 300s
nr_task: 100%
test: shell1
cpufreq_governor: performance


...


commit:
5bc34be478 ("sched/core: Reorganize cgroup bandwidth control interface file writes")
ac34cb39e8 ("sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails")

5bc34be478d09c4d ac34cb39e8aea9915ec2f4e08c9
---------------- ---------------------------
%stddev %change %stddev
\ | \
...

40.37 +16.9 57.24 mpstat.cpu.all.idle%

This commit inhibits the newidle balance.

When not successful. So when newidle balance is not succeeding to pull
tasks, it is backing off and doing less of it.

It seems that some workloads
do not like newlyidle balance, like schbench, which is short duration
task. While other workloads want the newidle balance to pull at its best
effort, like unixbench shell test case.
Just wonder if we can check the sched domain's average utilization to
decide how hard we should trigger the newly idle balance, or can we check
the overutilized flag to decide whether we should launch the
new idle balance, something I was thinking of:

Looking at the actual util signal might be interesting, but as Chris
already noted, overutilized isn't the right thing to look at. Simply
taking rq->cfs.avg.util_avg might be more useful. Very high util and
failure to pull might indicate new-idle just isn't very important /
effective. While low util and failure might mean we should try harder.

Other things to look at:

- if the sysctl_sched_migration_cost limit isn't artificially limiting
actual scanning costs. Eg. very large domains might perhaps have
costs that are genuinely larger than that somewhat random number.

- if despite the apparent failure to pull, we do already have something
to run (eg. wakeups).

- if the 3/2 backoff is perhaps too aggressive vs the 1% per second
decay.

Thanks for the suggestions, let me try to reproduce this issue locally
to see what is the proper way to get it addressed.


thanks,
Chenyu