[PATCH 6/6] sched/numa: Limit the conditions where scan period is reset

From: Srikar Dronamraju
Date: Fri Aug 03 2018 - 02:15:05 EST


From: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

migrate_task_rq_fair resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.

This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine. In that case, scanning faster and trapping more NUMA faults
will further overload the machine.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 208862 209029 0.0799571
1 307007 326585 6.37705


on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89911.4 89627.8 -0.315422
1 216176 221299 2.36983


on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 196078 195444 -0.323341
1 214664 222390 3.59911


on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 60719.2 60152.4 -0.933477
1 112615 111458 -1.02739


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12511.7 12559.4 12539.5 15.5883
5 12904.6 12969 12942.6 23.9053 3.21464


on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4709.28 4979.28 4919.32 105.126
5 4984.25 5025.95 5004.5 14.2253 1.73154


on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9388.38 9406.29 9395.1 5.98959
5 9277.64 9357.22 9322.07 26.3558 -0.77732


on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 157.71 184.929 174.754 10.7275
5 160.632 175.558 168.655 5.26823 -3.49005


Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
---
kernel/sched/fair.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ea0eff..6e251e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6357,6 +6357,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
p->se.exec_start = 0;

#ifdef CONFIG_NUMA_BALANCING
+ if (!static_branch_likely(&sched_numa_balancing))
+ return;
+
if (!p->mm || (p->flags & PF_EXITING))
return;

@@ -6364,8 +6367,26 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
int src_nid = cpu_to_node(task_cpu(p));
int dst_nid = cpu_to_node(new_cpu);

- if (src_nid != dst_nid)
- p->numa_scan_period = task_scan_start(p);
+ if (src_nid == dst_nid)
+ return;
+
+ /*
+ * Allow resets if faults have been trapped before one scan
+ * has completed. This is most likely due to a new task that
+ * is pulled cross-node due to wakeups or load balancing.
+ */
+ if (p->numa_scan_seq) {
+ /*
+ * Avoid scan adjustments if moving to the preferred
+ * node or if the task was not previously running on
+ * the preferred node.
+ */
+ if (dst_nid == p->numa_preferred_nid ||
+ (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid))
+ return;
+ }
+
+ p->numa_scan_period = task_scan_start(p);
}
#endif
}
--
1.8.3.1