Re: [PATCH 0/4] sched/topology: fix overlap group capacity and balance cpu

From: Peter Zijlstra
Date: Wed Apr 26 2017 - 18:44:15 EST


On Wed, Apr 26, 2017 at 02:59:09PM -0300, Lauro Venancio wrote:
> On 04/26/2017 01:31 PM, Peter Zijlstra wrote:
> >
> > Please have a look here:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/wip
> >
> > I think that has them all.
> Yes. I agree.

FWIW, this is how far I got with writing comments...

I'll try and finish on Friday (holiday tomorrow).

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9e276f5a83a4..e6ba3f0793ff 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -494,6 +494,68 @@ enum s_alloc {
};

/*
+ * Return the canonical balance CPU for this group, this is the first CPU
+ * of this group that's also in the iteration mask.
+ */
+int group_balance_cpu(struct sched_group *sg)
+{
+ return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+}
+
+
+/*
+ * NUMA topology (first read the regular topology blurb below)
+ *
+ * Given a node-distance table, for example:
+ *
+ * node 0 1 2 3
+ * 0: 10 20 30 20
+ * 1: 20 10 20 30
+ * 2: 30 20 10 20
+ * 3: 20 30 20 10
+ *
+ * which represents a 4 node ring topology like:
+ *
+ * 0 ----- 1
+ * | |
+ * | |
+ * | |
+ * 3 ----- 2
+ *
+ * We want to construct domains and groups to represent this. The way we go about
+ * doing this is to build the domains on 'hops'. For each NUMA level we construct
+ * the mask of all nodes reachable in @level hops.
+ *
+ * For the above NUMA topology that gives 3 levels:
+ *
+ * NUMA-2 0-3 0-3 0-3 0-3
+ *
+ * NUMA-1 0-1,3 0-2 1-3 0,2-3
+ *
+ * NUMA-0 0 1 2 3
+ *
+ *
+ * As can be seen; things don't nicely line up as with the regular topology.
+ * When we iterate a domain in child domain chunks some nodes can be
+ * represented multiple times -- hence the "overlap" naming for this part of
+ * the topology.
+ *
+ * In order to minimize this overlap, we only build enough groups to cover the
+ * domain. For instance NODE-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
+ *
+ * Because:
+ *
+ * - the first group of each domain is its child domain; this
+ * gets us the first 0-1,3
+ * - the only uncovered node is 2, who's child domain is 1-3.
+ *
+ * XXX words on sched_group_capacity and balance_cpu
+ *
+ */
+
+
+
+/*
* Build an iteration mask that can exclude certain CPUs from the upwards
* domain traversal.
*
@@ -532,15 +594,6 @@ build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask
WARN_ON_ONCE(cpumask_empty(mask));
}

-/*
- * Return the canonical balance CPU for this group, this is the first CPU
- * of this group that's also in the iteration mask.
- */
-int group_balance_cpu(struct sched_group *sg)
-{
- return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
-}
-
static struct sched_group *
build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
{
@@ -646,6 +699,59 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
return -ENOMEM;
}

+
+/*
+ * Package topology (also see the load-balance blurb in fair.c)
+ *
+ * The scheduler builds a tree structure to represent a number of important
+ * topology features. By default (default_topology[]) these include:
+ *
+ * - Simultaneous multithreading (SMT)
+ * - Multi-Core Cache (MC)
+ * - Package (DIE)
+ *
+ * Where the last one more or less denotes everything up to a NUMA node.
+ *
+ * The tree consists of 3 primary data structures:
+ *
+ * sched_domain -> sched_group -> sched_group_capacity
+ *
+ * The sched_domains are per-cpu and have a two way link (parent & child) and
+ * denote the ever growing mask of CPUs belonging to that level of topology.
+ *
+ * Each sched_domain has a circular (double) linked list of sched_group's, each
+ * denoting the domains of the level below (or individual CPUs in case of the
+ * first domain level). The sched_group linked by a sched_domain includes the
+ * CPU of that sched_domain [*].
+ *
+ * Take for instance a 2 threaded, 2 core, 2 cache cluster part:
+ *
+ * CPU 0 1 2 3 4 5 6 7
+ *
+ * DIE [ ]
+ * MC [ ] [ ]
+ * SMT [ ] [ ] [ ] [ ]
+ *
+ * - or -
+ *
+ * DIE 0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7
+ * MC 0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7
+ * SMT 0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7
+ *
+ * CPU 0 1 2 3 4 5 6 7
+ *
+ * One way to think about it is: sched_domain moves you up and down among these
+ * topology levels, while sched_group moves you sideways through it, at child
+ * domain granularity.
+ *
+ * sched_group_capacity ensures each unique sched_group has shared storage.
+ *
+ * XXX words on how these levels don't overlap
+ * XXX words on balance CPU
+ *
+ * [*] in other words, the first group of each domain is its child domain.
+ */
+
static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
{
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);