[RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

From: Lauro Ramos Venancio
Date: Thu Apr 13 2017 - 09:57:09 EST


Currently, on a 4 nodes NUMA machine with ring topology, two sched
groups are generated for the last NUMA sched domain. One group has the
CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
scheduler is unable to directly move tasks between these nodes. In the
worst scenario, when a set of tasks are bound to nodes 1 and 3, the
performance is severely impacted because just one node is used while the
other node remains idle.

This problem also affects machines with more NUMA nodes. For instance,
currently, the scheduler is unable to directly move tasks between some
node pairs 2-hops apart on an 8 nodes machine with mesh topology.

This bug was reported in the paper [1] as "The Scheduling Group
Construction bug".

This patch constructs the sched groups from each CPU perspective. So, on
a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
[(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
apart.

SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology

Threads before after %
mean stddev mean stddev
1 22801 1950 27059 1367 +19%
8 146008 50782 209193 826 +43%
32 351030 105111 522445 9051 +49%
48 365835 116571 594905 3314 +63%

[1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

Signed-off-by: Lauro Ramos Venancio <lvenanci@xxxxxxxxxx>
---
kernel/sched/topology.c | 33 +++++++++++++++------------------
1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d786d45..d0302ad 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -557,14 +557,24 @@ static void init_overlap_sched_group(struct sched_domain *sd,
static int
build_overlap_sched_groups(struct sched_domain *sd, int cpu)
{
- struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+ struct sched_group *last = NULL, *sg;
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
int i;

- cpumask_clear(covered);
+ sg = build_group_from_child_sched_domain(sd, cpu);
+ if (!sg)
+ return -ENOMEM;
+
+ init_overlap_sched_group(sd, sg, cpu);
+
+ sd->groups = sg;
+ last = sg;
+ sg->next = sg;
+
+ cpumask_copy(covered, sched_group_cpus(sg));

for_each_cpu(i, span) {
struct cpumask *sg_span;
@@ -587,28 +597,15 @@ static void init_overlap_sched_group(struct sched_domain *sd,

init_overlap_sched_group(sd, sg, i);

- /*
- * Make sure the first group of this domain contains the
- * canonical balance CPU. Otherwise the sched_domain iteration
- * breaks. See update_sg_lb_stats().
- */
- if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
- group_balance_cpu(sg) == cpu)
- groups = sg;
-
- if (!first)
- first = sg;
- if (last)
- last->next = sg;
+ last->next = sg;
last = sg;
- last->next = first;
+ sg->next = sd->groups;
}
- sd->groups = groups;

return 0;

fail:
- free_sched_groups(first, 0);
+ free_sched_groups(sd->groups, 0);

return -ENOMEM;
}
--
1.8.3.1