[PATCH] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

From: peter . puhov
Date: Tue Jun 16 2020 - 12:48:09 EST


From: Peter Puhov <peter.puhov@xxxxxxxxxx>

In slow path, when selecting idlest group, if both groups have type
group_has_spare, only idle_cpus count gets compared.
As a result, if multiple tasks are created in a tight loop,
and go back to sleep immediately
(while waiting for all tasks to be created),
they may be scheduled on the same core, because CPU is back to idle
when the new fork happen.

For example:
sudo perf record -e sched:sched_wakeup_new -- \
sysbench threads --threads=4 run
...
total number of events: 61582
...
sudo perf script
sysbench 129378 [006] 74586.633466: sched:sched_wakeup_new:
sysbench:129380 [120] success=1 CPU:007
sysbench 129378 [006] 74586.634718: sched:sched_wakeup_new:
sysbench:129381 [120] success=1 CPU:007
sysbench 129378 [006] 74586.635957: sched:sched_wakeup_new:
sysbench:129382 [120] success=1 CPU:007
sysbench 129378 [006] 74586.637183: sched:sched_wakeup_new:
sysbench:129383 [120] success=1 CPU:007

This may have negative impact on performance for workloads with frequent
creation of multiple threads.

In this patch we using group_util to select idlest group if both groups
have equal number of idle_cpus. In this case newly created tasks would be
better distributed. It is possible to use nr_running instead of group_util,
but result is less predictable.

With this patch:
sudo perf record -e sched:sched_wakeup_new -- \
sysbench threads --threads=4 run
...
total number of events: 74401
...
sudo perf script
sysbench 129455 [006] 75232.853257: sched:sched_wakeup_new:
sysbench:129457 [120] success=1 CPU:008
sysbench 129455 [006] 75232.854489: sched:sched_wakeup_new:
sysbench:129458 [120] success=1 CPU:009
sysbench 129455 [006] 75232.855732: sched:sched_wakeup_new:
sysbench:129459 [120] success=1 CPU:010
sysbench 129455 [006] 75232.856980: sched:sched_wakeup_new:
sysbench:129460 [120] success=1 CPU:011

We tested this patch with following benchmarks:
perf bench -f simple sched pipe -l 4000000
perf bench -f simple sched messaging -l 30000
perf bench -f simple mem memset -s 3GB -l 15 -f default
perf bench -f simple futex wake -s -t 640 -w 1
sysbench cpu --threads=8 --cpu-max-prime=10000 run
sysbench memory --memory-access-mode=rnd --threads=8 run
sysbench threads --threads=8 run
sysbench mutex --mutex-num=1 --threads=8 run
hackbench --loops 20000
hackbench --pipe --threads --loops 20000
hackbench --pipe --threads --loops 20000 --datasize 4096

and found some performance improvements in:
sysbench threads
sysbench mutex
perf bench futex wake
and no regressions in others.

master: 'commit b3a9e3b9622a ("Linux 5.8-rc1")'
$> sysbench threads --threads=16 run
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Initializing random number generator from current time
Initializing worker threads...
Threads started!
General statistics:
total time: 10.0079s
total number of events: 45526 << higher is better
Latency (ms):
min: 0.36
avg: 3.52
max: 54.22
95th percentile: 23.10
sum: 160044.33
Threads fairness:
events (avg/stddev): 2845.3750/94.18
execution time (avg/stddev): 10.0028/0.00

With patch:
$> sysbench threads --threads=16 run
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Initializing random number generator from current time
Initializing worker threads...
Threads started!
General statistics:
total time: 10.0053s
total number of events: 56567 << higher is better
Latency (ms):
min: 0.36
avg: 2.83
max: 27.65
95th percentile: 18.95
sum: 160003.83

Threads fairness:
events (avg/stddev): 3535.4375/147.38
execution time (avg/stddev): 10.0002/0.00

master: 'commit b3a9e3b9622a ("Linux 5.8-rc1")'
$> sysbench mutex --mutex-num=1 --threads=32 run
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 32
Initializing random number generator from current time
Initializing worker threads...
Threads started!
General statistics:
total time: 1.0415s << lower is better
total number of events: 32
Latency (ms):
min: 940.57
avg: 959.24
max: 1041.05
95th percentile: 960.30
sum: 30695.84
Threads fairness:
events (avg/stddev): 1.0000/0.00
execution time (avg/stddev): 0.9592/0.02

With patch:
@> sysbench mutex --mutex-num=1 --threads=32 run
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 32
Initializing random number generator from current time
Initializing worker threads...
Threads started!
General statistics:
total time: 0.9209s << lower is better
total number of events: 32
Latency (ms):
min: 867.37
avg: 892.09
max: 920.70
95th percentile: 909.80
sum: 28546.84
Threads fairness:
events (avg/stddev): 1.0000/0.00
execution time (avg/stddev): 0.8921/0.01

master: 'commit b3a9e3b9622a ("Linux 5.8-rc1")'
$> perf bench futex wake -s -t 128 -w 1
# Running 'futex/wake' benchmark:
Run summary [PID 2414]: blocking on 128 threads
(at [private] futex 0xaaaab663a154), waking up 1 at a time.
Wokeup 128 of 128 threads in 0.2852 ms (+-1.86%) << lower is better

With patch:
$> perf bench futex wake -s -t 128 -w 1
# Running 'futex/wake' benchmark:
Run summary [PID 5057]: blocking on 128 threads
(at [private] futex 0xaaaace461154), waking up 1 at a time.
Wokeup 128 of 128 threads in 0.2705 ms (+-1.84%) << lower is better

Signed-off-by: Peter Puhov <peter.puhov@xxxxxxxxxx>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..abcbdf80ee75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8662,8 +8662,14 @@ static bool update_pick_idlest(struct sched_group *idlest,

case group_has_spare:
/* Select group with most idle CPUs */
- if (idlest_sgs->idle_cpus >= sgs->idle_cpus)
+ if (idlest_sgs->idle_cpus > sgs->idle_cpus)
return false;
+
+ /* Select group with lowest group_util */
+ if (idlest_sgs->idle_cpus == sgs->idle_cpus &&
+ idlest_sgs->group_util <= sgs->group_util)
+ return false;
+
break;
}

--
2.20.1