From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Hi all,
One of the many things on the eternal todo list has been finishing the
below hackery.
It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node.
Anyway, I wrote this about a year ago, and I mentioned this at the
recent OSPM conf where Gautham and Prateek expressed interest in playing
with this code.
So here goes, very rough and largely unproven code ahead :-)
It applies to current tip/master, but I know it will fail the __percpu
validation that sits in -next, although that shouldn't be terribly hard
to fix up.
As is, it only computes a CPU inside the LLC that has the highest recent
runtime, this CPU is then used in the wake-up path to steer towards this
LLC and in task_hot() to limit migrations away from it.
More elaborate things could be done, notably there is an XXX in there
somewhere about finding the best LLC inside a NODE (interaction with
NUMA_BALANCING).
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
include/linux/mm_types.h | 44 ++++++
include/linux/sched.h | 4 +
init/Kconfig | 4 +
kernel/fork.c | 5 +
kernel/sched/core.c | 13 +-
kernel/sched/fair.c | 330 +++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 8 +
7 files changed, 388 insertions(+), 20 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..013291c6aaa2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -893,6 +893,12 @@ struct mm_cid {
};
#endif
+static void task_cache_work(struct callback_head *work)
+{
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+ unsigned long m_a_occ = 0;
+ int cpu, m_a_cpu = -1;
+ cpumask_var_t cpus;
+
+ WARN_ON_ONCE(work != &p->cache_work);
+
+ work->next = work;
+
+ if (p->flags & PF_EXITING)
+ return;
+
+ if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+ return;
+
+ scoped_guard (cpus_read_lock) {
+ cpumask_copy(cpus, cpu_online_mask);
+
+ for_each_cpu(cpu, cpus) {
+ /* XXX sched_cluster_active */
+ struct sched_domain *sd = per_cpu(sd_llc, cpu);
+ unsigned long occ, m_occ = 0, a_occ = 0;
+ int m_cpu = -1, nr = 0, i;
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ occ = fraction_mm_sched(cpu_rq(i),
+ per_cpu_ptr(mm->pcpu_sched, i));
+ a_occ += occ;
+ if (occ > m_occ) {
+ m_occ = occ;
+ m_cpu = i;
+ }
+ nr++;
+ trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
+ per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
+ }
+
+ a_occ /= nr;
+ if (a_occ > m_a_occ) {
+ m_a_occ = a_occ;
+ m_a_cpu = m_cpu;
+ }
+
+ trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
+ per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ /* XXX threshold ? */
+ per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
+ }
+
+ cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+ }
+ }
+
+ /*
+ * If the max average cache occupancy is 'small' we don't care.
+ */
+ if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
+ m_a_cpu = -1;
+
+ mm->mm_sched_cpu = m_a_cpu;
+
+ free_cpumask_var(cpus);
+}
+