Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
From: Mathieu Desnoyers
Date: Mon Sep 11 2023 - 16:49:39 EST
On 9/11/23 11:26, Mathieu Desnoyers wrote:
On 9/10/23 22:50, Chen Yu wrote:
[...]
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e20f50726ab8..fe3b760c9654 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
struct task_struct *p, int flags)
hrtick_update(rq);
now = sched_clock_cpu(cpu_of(rq));
p->se.prev_sleep_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+ /*
+ * If this rq will become idle, and dequeued task is
+ * a short sleeping one, check if we can reserve
+ * this idle CPU for that task for a short while.
+ * During this reservation period, other wakees will
+ * skip this 'idle' CPU in select_idle_cpu(), and this
+ * short sleeping task can pick its previous CPU in
+ * select_idle_sibling(), which brings better cache
+ * locality.
+ */
+ if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+ p->se.sleep_avg && p->se.sleep_avg <
sysctl_sched_migration_cost)
+ rq->cache_hot_timeout = now + p->se.sleep_avg;
This is really cool!
There is one scenario that worries me with this approach: workloads
that sleep for a long time and then have short blocked periods.
Those bursts will likely bring the average to values too high
to stay below sysctl_sched_migration_cost.
I wonder if changing the code above for the following would help ?
if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
p->se.sleep_avg)
rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
p->se.sleep_avg);
For tasks that have a large sleep_avg, it would activate this rq
"appear as not idle for rq selection" scheme for a window of
sysctl_sched_migration_cost. If the sleep ends up being a long one,
preventing other tasks from being migrated to this rq for a tiny
window should not matter performance-wise. I would expect that it
could help workloads that come in bursts.
There is perhaps a better way to handle bursts:
When calculating the sleep_avg, we actually only really care about
the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
to select which of the sleeps should be taken into account in the avg.
We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
u64 now = sched_clock_cpu(task_cpu(p));
if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
update_avg(&p->se.burst_sleep_avg, now - last_sleep);
Then we can have this code is dequeue_task_fair:
if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com