[RFC PATCH] sched/eevdf: Use customized slice to reduce wakeup latency and inhibit task migration

From: Chen Yu
Date: Tue May 07 2024 - 10:36:29 EST


Problem 1:
The overhead of task migration is high on many-core system. The overhead
brings performance penalty due to broken cache locality/higher cache-to-cache
latency.

Problem 2:
During wakeup, the time spent on searching for an idle CPU is costly on
many-core system. Besides, access to other CPU's rq statistics brings
cace contention:

available_idle_cpu(cpu) -> idle_cpu(cpu) -> {rq->curr, rq->nr_running}

Although SIS_UTIL throttles the scan depth based on system utilization,
there is requirement to further limit the scan depth for specific workload,
especially for short duration wakee.

Now we have the interface to customize the request/slice. The smaller the
slice is, the earlier the task can be picked up, and the lower wakeup latency
the task expects. Leverage the wakee's slice to further throttle the
idle CPU scan depth - the shorter slice, the less CPUs to scan.

Test on 240 CPUs, 2 sockets system. With SNC(sub-numa-cluster) enabled,
each LLC domain has 60 CPUs. There is noticeable improvement of netperf.
(With SNC disabled, more improvements should be seen because C2C is higher)

The global slice is 3 msec(sysctl_sched_base_slice) by default on my ubuntu
22.04, and the customized slice is set to 0.1 msec for both netperf and netserver:

for i in $(seq 1 $job); do
netperf_slice -e 100000 -4 -H 127.0.01 -t TCP_RR -c -C -l 100 &
done

case load baseline(std%) compare%( std%)
TCP_RR 60-threads 1.00 ( 1.60) +0.35 ( 1.73)
TCP_RR 120-threads 1.00 ( 1.34) -0.96 ( 1.24)
TCP_RR 180-threads 1.00 ( 1.59) +92.20 ( 4.24)
TCP_RR 240-threads 1.00 ( 9.71) +43.11 ( 2.97)

Suggested-by: Tim Chen <tim.c.chen@xxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
---
kernel/sched/fair.c | 23 ++++++++++++++++++++---
kernel/sched/features.h | 1 +
2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edc23f6588a3..f269ae7d6e24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7368,6 +7368,24 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd

#endif /* CONFIG_SCHED_SMT */

+/*
+ * Scale the scan number of idle CPUs according to customized
+ * wakee's slice. The smaller the slice is, the earlier the task
+ * wants be picked up, thus the lower wakeup latency the task expects.
+ * The baseline is the global sysctl_sched_base_slice. Task slice
+ * smaller than the global one would shrink the scan number.
+ */
+static int adjust_idle_scan(struct task_struct *p, int nr)
+{
+ if (!sched_feat(SIS_FAST))
+ return nr;
+
+ if (!p->se.custom_slice || p->se.slice >= sysctl_sched_base_slice)
+ return nr;
+
+ return div_u64(nr * p->se.slice, sysctl_sched_base_slice);
+}
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -7384,10 +7402,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (sched_feat(SIS_UTIL)) {
sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
if (sd_share) {
- /* because !--nr is the condition to stop scan */
- nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+ nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan));
/* overloaded LLC is unlikely to have idle cpu/core */
- if (nr == 1)
+ if (nr <= 0)
return -1;
}
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..176324236018 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_FAST, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1