Re: hotplug support for arch/arc/plat-eznps platform

From: Peter Zijlstra
Date: Tue Aug 08 2017 - 06:16:36 EST


On Tue, Aug 08, 2017 at 06:49:39AM +0000, Ofer Levi(SW) wrote:

> The idea behind implementing hotplug for this arch is to shorten time
> to traffic processing. This way instead of waiting ~5 min for all
> cpus to boot, application running on cpu 0 will Loop booting other
> cpus and assigning the traffic processing application to it.
> Outgoing traffic will build up until all cpus are up and running full
> traffic rate. This method allow for traffic processing to start after
> ~20 sec instead of the 5 min.

Ah, ok. So only online is ever used. Offline is a whole other can of
worms.

> > So how can boot be different than hot-pugging them?
>
> Please have a look at following code kernel/sched/core.c, sched_cpu_activate() :
>
> if (sched_smp_initialized) {
> sched_domains_numa_masks_set(cpu);
> cpuset_cpu_active();
> }

Ah, cute, I totally missed we did that. Yes that avoids endless domain
rebuilds on boot.

> The cpuset_cpu_active call eventually leads to the function in
> question partition_sched_domains() When cold-booting cpus the
> sched_smp_initialized flag is false and therefore
> partition_sched_domains is not executing.

So you're booting with "maxcpus=1" to only online the one. And then you
want to online the rest once userspace runs.

There's two possibilities. The one I prefer (but which appears the most
broken with the current code) is using the cpuset controller.

1)

Once you're up and running with a single CPU do:

$ mkdir /cgroup
$ mount none /cgroup -t cgroup -o cpuset
$ echo 0 > /cgroup/cpuset.sched_load_balance
$ for ((i=1;i<4096;i++))
do
echo 1 > /sys/devices/system/cpu/cpu$i/online;
done

And then, if you want load-balancing, you can re-enable it globally,
or only on a subset of CPUs.


2)

The alternative is to use "isolcpus=1-4095" to completely kill
load-balancing. This more or less works with the current code,
except that it will keep rebuilding the CPU0 sched-domain, which
is somewhat pointless (also fixed by the below patch).

The reason I don't particularly like this option is that its boot time
only, you cannot reconfigure your system at runtime, but that might
be good enough for you.


With the attached patch, either option generates (I only have 40 CPUs):

[ 44.305563] CPU0 attaching NULL sched-domain.
[ 51.954872] SMP alternatives: switching to SMP code
[ 51.976923] x86: Booting SMP configuration:
[ 51.981602] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 52.057756] microcode: sig=0x306e4, pf=0x1, revision=0x416
[ 52.064740] microcode: updated to revision 0x428, date = 2014-05-29
[ 52.080854] smpboot: Booting Node 0 Processor 2 APIC 0x4
[ 52.164124] smpboot: Booting Node 0 Processor 3 APIC 0x6
[ 52.244615] smpboot: Booting Node 0 Processor 4 APIC 0x8
[ 52.324564] smpboot: Booting Node 0 Processor 5 APIC 0x10
[ 52.405407] smpboot: Booting Node 0 Processor 6 APIC 0x12
[ 52.485460] smpboot: Booting Node 0 Processor 7 APIC 0x14
[ 52.565333] smpboot: Booting Node 0 Processor 8 APIC 0x16
[ 52.645364] smpboot: Booting Node 0 Processor 9 APIC 0x18
[ 52.725314] smpboot: Booting Node 1 Processor 10 APIC 0x20
[ 52.827517] smpboot: Booting Node 1 Processor 11 APIC 0x22
[ 52.912271] smpboot: Booting Node 1 Processor 12 APIC 0x24
[ 52.996101] smpboot: Booting Node 1 Processor 13 APIC 0x26
[ 53.081239] smpboot: Booting Node 1 Processor 14 APIC 0x28
[ 53.164990] smpboot: Booting Node 1 Processor 15 APIC 0x30
[ 53.250146] smpboot: Booting Node 1 Processor 16 APIC 0x32
[ 53.333894] smpboot: Booting Node 1 Processor 17 APIC 0x34
[ 53.419026] smpboot: Booting Node 1 Processor 18 APIC 0x36
[ 53.502820] smpboot: Booting Node 1 Processor 19 APIC 0x38
[ 53.587938] smpboot: Booting Node 0 Processor 20 APIC 0x1
[ 53.659828] microcode: sig=0x306e4, pf=0x1, revision=0x428
[ 53.674857] smpboot: Booting Node 0 Processor 21 APIC 0x3
[ 53.756346] smpboot: Booting Node 0 Processor 22 APIC 0x5
[ 53.836793] smpboot: Booting Node 0 Processor 23 APIC 0x7
[ 53.917753] smpboot: Booting Node 0 Processor 24 APIC 0x9
[ 53.998717] smpboot: Booting Node 0 Processor 25 APIC 0x11
[ 54.079674] smpboot: Booting Node 0 Processor 26 APIC 0x13
[ 54.160636] smpboot: Booting Node 0 Processor 27 APIC 0x15
[ 54.241592] smpboot: Booting Node 0 Processor 28 APIC 0x17
[ 54.322553] smpboot: Booting Node 0 Processor 29 APIC 0x19
[ 54.403487] smpboot: Booting Node 1 Processor 30 APIC 0x21
[ 54.487676] smpboot: Booting Node 1 Processor 31 APIC 0x23
[ 54.571921] smpboot: Booting Node 1 Processor 32 APIC 0x25
[ 54.656508] smpboot: Booting Node 1 Processor 33 APIC 0x27
[ 54.740835] smpboot: Booting Node 1 Processor 34 APIC 0x29
[ 54.824466] smpboot: Booting Node 1 Processor 35 APIC 0x31
[ 54.908374] smpboot: Booting Node 1 Processor 36 APIC 0x33
[ 54.992322] smpboot: Booting Node 1 Processor 37 APIC 0x35
[ 55.076333] smpboot: Booting Node 1 Processor 38 APIC 0x37
[ 55.160249] smpboot: Booting Node 1 Processor 39 APIC 0x39


---
Subject: sched,cpuset: Avoid spurious/wrong domain rebuilds

When disabling cpuset.sched_load_balance we expect to be able to online
CPUs without generating sched_domains. However this is currently
completely broken.

What happens is that we generate the sched_domains and then destroy
them. This is because of the spurious 'default' domain build in
cpuset_update_active_cpus(). That builds a single machine wide domain
and then schedules a work to build the 'real' domains. The work then
finds there are _no_ domains and destroys the lot again.

Furthermore, if there actually were cpusets, building the machine wide
domain is actively wrong, because it would allow tasks to 'escape' their
cpuset. Also I don't think its needed, the scheduler really should
respect the active mask.

Also (this should probably be a separate patch) fix
partition_sched_domains() to try and preserve the existing machine wide
domain instead of unconditionally destroying it. We do this by
attempting to allocate the new single domain, only when that fails to we
reuse the fallback_doms.

Cc: Tejun Heo <tj@xxxxxxxxxx>
Almost-Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/cgroup/cpuset.c | 6 ------
kernel/sched/topology.c | 15 ++++++++++++---
2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca8376e5008c..e557cdba2350 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2342,13 +2342,7 @@ void cpuset_update_active_cpus(void)
* We're inside cpu hotplug critical region which usually nests
* inside cgroup synchronization. Bounce actual hotplug processing
* to a work item to avoid reverse locking order.
- *
- * We still need to do partition_sched_domains() synchronously;
- * otherwise, the scheduler will get confused and put tasks to the
- * dead CPU. Fall back to the default single domain.
- * cpuset_hotplug_workfn() will rebuild it as necessary.
*/
- partition_sched_domains(1, NULL, NULL);
schedule_work(&cpuset_hotplug_work);
}

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895aec281e..1b74b2cc5dba 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1854,7 +1854,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
/* Let the architecture update CPU core mappings: */
new_topology = arch_update_cpu_topology();

- n = doms_new ? ndoms_new : 0;
+ if (!doms_new) {
+ WARN_ON_ONCE(dattr_new);
+ n = 0;
+ doms_new = alloc_sched_domains(1);
+ if (doms_new) {
+ n = 1;
+ cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
+ }
+ } else {
+ n = ndoms_new;
+ }

/* Destroy deleted domains: */
for (i = 0; i < ndoms_cur; i++) {
@@ -1870,11 +1880,10 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
}

n = ndoms_cur;
- if (doms_new == NULL) {
+ if (!doms_new) {
n = 0;
doms_new = &fallback_doms;
cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
- WARN_ON_ONCE(dattr_new);
}

/* Build new domains: */