[PATCH 5/5] sched/fair: Add exception for hints in load balancing path

From: K Prateek Nayak
Date: Sat Sep 10 2022 - 06:56:04 EST


- Load balancing considerations

If we have more tasks than the CPUs in the MC Domain, ignore the hint
set by the user. This prevents losing the consolidation done at the
wakeup time.

- Considerations

Few trial and errors were done to find a good threshold to ignore hints.
Following are some of the wins and woes:

o Ignore hint if MC domain of src CPU does not have an idle core: This
metric is not very accurate and led to losing consolidation early on.
o Ignore hint if sd_shared->nr_llc_scan is 0: This too, like the
has_idle core metric was not always accurate.
o An atomic read of sd_shared->nr_busy_cpus doesn't encapsulate
overloaded run queues.

Best results were found by scanning LLC and finding the number of
running tasks and comparing it with size of LLC. If the LLC is beyond
fully loaded, safely ignore hint.

- Possible Improvements

o Consider the status of hint: If a wake affine hint was ignored in
the wakeup path, consider ignoring in the load balancer path as well
as the running LLC is not the desired LLC in fact.

Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c61bd0e93b3..8e1679b784fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7810,6 +7810,9 @@ struct lb_env {
unsigned int loop_break;
unsigned int loop_max;

+ /* Indicator to ignore hint if LLC is overloaded */
+ int ignore_hint;
+
enum fbq_type fbq_type;
enum migration_type migration_type;
struct list_head tasks;
@@ -7977,6 +7980,21 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}

+ /*
+ * Hints are followed only if the MC Domain is still ideal
+ * for the task.
+ */
+ if (!env->ignore_hint) {
+ /*
+ * Only consider the hints from the wakeup path to maintain
+ * data locality.
+ */
+ if (READ_ONCE(p->hint) &
+ (PR_SCHED_HINT_WAKE_AFFINE | PR_SCHED_HINT_WAKE_HOLD))
+ return 0;
+ }
+
+
/* Record that we found at least one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

@@ -10182,6 +10200,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.cpus = cpus,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
+ .ignore_hint = 1,
};

cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
@@ -10213,6 +10232,30 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.src_cpu = busiest->cpu;
env.src_rq = busiest;

+ /*
+ * Check if the hints can be followed during
+ * this load balancing cycle.
+ */
+ if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
+ struct sched_domain *src_sd_llc = rcu_dereference(per_cpu(sd_llc, env.src_cpu));
+
+ if (src_sd_llc) {
+ int cpu, nr_llc_running = 0, llc_size = per_cpu(sd_llc_size, env.src_cpu);
+
+ for_each_cpu_wrap(cpu, sched_domain_span(src_sd_llc), env.src_cpu) {
+ struct rq *rq = cpu_rq(cpu);
+ nr_llc_running += rq->nr_running - rq->cfs.idle_h_nr_running;
+ }
+
+ /*
+ * Don't ignore hint if we can have one task
+ * per CPU in the LLC of the src_cpu.
+ */
+ if (nr_llc_running <= llc_size)
+ env.ignore_hint = 0;
+ }
+ }
+
ld_moved = 0;
/* Clear this flag as soon as we find a pullable task */
env.flags |= LBF_ALL_PINNED;
@@ -10520,6 +10563,7 @@ static int active_load_balance_cpu_stop(void *data)
.src_rq = busiest_rq,
.idle = CPU_IDLE,
.flags = LBF_ACTIVE_LB,
+ .ignore_hint = sd->flags & SD_SHARE_PKG_RESOURCES,
};

schedstat_inc(sd->alb_count);
--
2.25.1