Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

From: David Woodhouse
Date: Tue Jan 19 2021 - 07:27:48 EST


On Tue, 2020-12-15 at 22:20 +0100, Thomas Gleixner wrote:
> Since the rewrite of the CPU hotplug infrastructure to a state machine
> it's pretty obvious that the bringup of APs can changed from the fully
> serialized:
>
> for_each_present_cpu(cpu) {
> if (!cpu_online(cpu))
> cpu_up(cpu, CPUHP_ONLINE);
> }
>
> to
>
> for_each_present_cpu(cpu) {
> if (!cpu_online(cpu))
> cpu_up(cpu, CPUHP_KICK_CPU);
> }
>
> for_each_present_cpu(cpu) {
> if (!cpu_active(cpu))
> cpu_up(cpu, CPUHP_ONLINE);
> }
>
> The CPUHP_KICK_CPU state does not exist today, but it's just the logical
> consequence of the state machine. It's basically splitting __cpu_up()
> into:
>
> __cpu_kick()
> {
> prepare();
> arch_kick_remote_cpu(); -> Send IPI/NMI, Firmware call .....
> }
>
> __cpu_wait_online()
> {
> wait_until_cpu_online();
> do_further_stuff();
> }
>
> There is some more to it than just blindly splitting it up at the
> architecture level.

We've been playing with this a little. There's a proof-of-concept hack
below; don't look too hard because it's only really for figuring out
the timing etc.

Basically we ripped out the 'wait' parts of the x86 do_boot_cpu() into
a separate function do_wait_cpu(). There are four phases to the wait.

• Wait for the AP to turn up in cpu_initialized_mask, set its bit in
cpu_callout_mask to allow it to run the AP thread.
• Wait for it to finish init and show up in cpu_callin_mask.
• check_tsc_sync_source()
• Wait for cpu_online(cpu)

There's an EARLY_INIT macro which controls whether the early bringup
call actually *does* anything, or whether it's left until bringup_cpu()
as the current code does. It allows a simple comparison of the two.

First we tested under qemu (on a Skylake EC2 c5.metal instance). The
do_boot_cpu() actually sending the IPIs took ~300k cycles itself.
Without EARLY_INIT we see timing for the four wait phases along the
lines of:

[ 0.285312] CPU#10 up in 192950, 952898, 60014786, 28 ( 61160662)
[ 0.288311] CPU#11 up in 181092, 962704, 60010432, 30 ( 61154258)
[ 0.291312] CPU#12 up in 386080, 970416, 60013956, 28 ( 61370480)
[ 0.294311] CPU#13 up in 372782, 964506, 60010564, 28 ( 61347880)
[ 0.297312] CPU#14 up in 389602, 976280, 60013046, 28 ( 61378956)
[ 0.300312] CPU#15 up in 213132, 968148, 60012138, 28 ( 61193446)

If we define EARLY_INIT then that first phase of waiting for the CPU
add itself is fairly much instantaneous, which is precisely what we
were hoping for. We also seem to save about 300k cycles on the AP
bringup too. It's just that it *all* pales into insignificance with
whatever it's doing to synchronise the TSC for 60M cycles.

[ 0.338829] CPU#10 up in 600, 689054, 60025522, 28 ( 60715204)
[ 0.341829] CPU#11 up in 610, 635346, 60019390, 28 ( 60655374)
[ 0.343829] CPU#12 up in 632, 619352, 60020728, 28 ( 60640740)
[ 0.346829] CPU#13 up in 602, 514234, 60025402, 26 ( 60540264)
[ 0.348830] CPU#14 up in 608, 621058, 60025952, 26 ( 60647644)
[ 0.351829] CPU#15 up in 600, 624690, 60021526, 410 ( 60647226)

Testing on real hardware has been more interesting and less useful so
far. We started with the CPUHP_BRINGUP_KICK_CPU state being
*immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
that didn't come up at all even without actually *doing* anything in
the pre-bringup phase. Merely bringing all the AP threads up through
the various CPUHP_PREPARE_foo stages before actually bringing them
online, was enough to break it. I have no serial port on this box so we
haven't get worked out why; I've resorted to putting the
CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.

That lets it boot without the EARLY_INIT at least (so it's basically a
no-op), and I get these timings. Looks like there's 3-4M cycles to be
had by the parallel SIPI/INIT, but the *first* thread of each core is
also taking another 8M cycles and it might be worth doing *those* in
parallel too. And Thomas I think that waiting for the AP to bring
itself up is the part you meant was pointlessly differently
reimplemented across architectures? So the way forward there might be
to offer a generic CPUHP state for that, for architectures to plug into
and ditch their own tracking.

[ 0.311581] CPU#1 up in 4057008, 8820492, 1828, 808 ( 12880136)
[ 0.316802] CPU#2 up in 3885080, 8738092, 1792, 904 ( 12625868)
[ 0.321674] CPU#3 up in 3468276, 8244880, 1724, 860 ( 11715740)
[ 0.326609] CPU#4 up in 3565216, 8357876, 1808, 984 ( 11925884)
[ 0.331565] CPU#5 up in 3566916, 8367340, 1836, 708 ( 11936800)
[ 0.336446] CPU#6 up in 3465324, 8249512, 1756, 796 ( 11717388)
[ 0.341337] CPU#7 up in 3518268, 8313476, 1572, 1072 ( 11834388)
[ 0.346196] CPU#8 up in 3479444, 8260244, 1648, 608 ( 11741944)
[ 0.351068] CPU#9 up in 3475692, 8269092, 1568, 908 ( 11747260)
[ 0.355968] CPU#10 up in 3534648, 8336648, 1488, 864 ( 11873648)
[ 0.361306] CPU#11 up in 4028288, 8932216, 1632, 692 ( 12962828)
[ 0.366657] CPU#12 up in 4046256, 8941736, 1624, 1164 ( 12990780)
[ 0.371985] CPU#13 up in 4012912, 8922192, 1700, 964 ( 12937768)
[ 0.373813] CPU#14 up in 3794196, 300948, 1520, 1300 ( 4097964)
[ 0.374936] CPU#15 up in 3853616, 265080, 1428, 784 ( 4120908)
[ 0.376843] CPU#16 up in 3841572, 261448, 1428, 528 ( 4104976)
[ 0.378597] CPU#17 up in 3420856, 258888, 1272, 872 ( 3681888)
[ 0.380403] CPU#18 up in 3516220, 259840, 2152, 648 ( 3778860)
[ 0.382210] CPU#19 up in 3503316, 262876, 1720, 500 ( 3768412)
[ 0.383975] CPU#20 up in 3421752, 263248, 1472, 764 ( 3687236)
[ 0.385747] CPU#21 up in 3434744, 272240, 1352, 716 ( 3709052)
[ 0.387516] CPU#22 up in 3427700, 273900, 1260, 820 ( 3703680)
[ 0.389300] CPU#23 up in 3457724, 269708, 1328, 816 ( 3729576)
[ 0.391089] CPU#24 up in 3466012, 269136, 1296, 824 ( 3737268)
[ 0.393067] CPU#25 up in 3970568, 279256, 1432, 892 ( 4252148)
[ 0.395042] CPU#26 up in 3977228, 283956, 1656, 772 ( 4263612)
[ 0.397020] CPU#27 up in 3946448, 288852, 1600, 648 ( 4237548)

When I enabled EARLY_INIT it didn't boot; I need to hook up some box
with a serial port to make more meaningful progress there, but figured
it was worth sharing the findings so far.

Here's the hack we're testing with, for reference. It's kind of ugly
but you can see where it's going. Note that the CMOS mangling for the
warm reset vector is going to need to be lifted out of the per-cpu
loop, and done *once* at startup and torn down once in smp_cpus_done.
Except that it also needs to be done before/after a hotplug cpu up;
we'll have to come back to that but we've just shifted it to
native_smp_cpus_done() for testing for now.


diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7f57ede3cb8e..99d1fa254921 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -308,7 +308,7 @@ static void kvm_register_steal_time(void)
return;

wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
- pr_info("stealtime: cpu %d, msr %llx\n", cpu,
+ if (0) pr_info("stealtime: cpu %d, msr %llx\n", cpu,
(unsigned long long) slow_virt_to_phys(st));
}

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa593743acf6..79a5c26c376e 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -108,7 +108,7 @@ static inline void kvm_sched_clock_init(bool stable)
kvm_sched_clock_offset = kvm_clock_read();
pv_ops.time.sched_clock = kvm_sched_clock_read;

- pr_info("kvm-clock: using sched offset of %llu cycles",
+ if (0) pr_info("kvm-clock: using sched offset of %llu cycles",
kvm_sched_clock_offset);

BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
@@ -184,7 +184,7 @@ static void kvm_register_clock(char *txt)

pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
wrmsrl(msr_kvm_system_time, pa);
- pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
+ if (0) pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
}

static void kvm_save_sched_clock_state(void)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index de776b2e6046..42f479979b52 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -360,7 +360,7 @@ int topology_update_die_map(unsigned int die, unsigned int cpu)
goto found;

new = logical_die++;
- if (new != die) {
+ if (0 && new != die) {
pr_info("CPU %u Converting physical %u to logical die %u\n",
cpu, die, new);
}
@@ -1028,9 +1028,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
{
/* start_ip had better be page-aligned! */
unsigned long start_ip = real_mode_header->trampoline_start;
-
unsigned long boot_error = 0;
- unsigned long timeout;

idle->thread.sp = (unsigned long)task_pt_regs(idle);
early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
@@ -1083,55 +1081,71 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
cpu0_nmi_registered);

- if (!boot_error) {
- /*
- * Wait 10s total for first sign of life from AP
- */
- boot_error = -1;
- timeout = jiffies + 10*HZ;
- while (time_before(jiffies, timeout)) {
- if (cpumask_test_cpu(cpu, cpu_initialized_mask)) {
- /*
- * Tell AP to proceed with initialization
- */
- cpumask_set_cpu(cpu, cpu_callout_mask);
- boot_error = 0;
- break;
- }
- schedule();
- }
- }
+ return boot_error;
+}

- if (!boot_error) {
- /*
- * Wait till AP completes initial initialization
- */
- while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
+int do_wait_cpu(unsigned int cpu)
+{
+ unsigned long flags;
+ unsigned long timeout;
+ cycles_t t1 = get_cycles(), t2, t3, t4, t5;
+ /*
+ * Wait 10s total for first sign of life from AP
+ */
+ int err = -1;
+ timeout = jiffies + 10*HZ;
+ while (time_before(jiffies, timeout)) {
+ if (cpumask_test_cpu(cpu, cpu_initialized_mask)) {
/*
- * Allow other tasks to run while we wait for the
- * AP to come online. This also gives a chance
- * for the MTRR work(triggered by the AP coming online)
- * to be completed in the stop machine context.
+ * Tell AP to proceed with initialization
*/
- schedule();
+ cpumask_set_cpu(cpu, cpu_callout_mask);
+ err = 0;
+ break;
}
+ schedule();
}

- if (x86_platform.legacy.warm_reset) {
+ if (err)
+ return err;
+ t2 = get_cycles();
+ /*
+ * Wait till AP completes initial initialization
+ */
+ while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
/*
- * Cleanup possible dangling ends...
+ * Allow other tasks to run while we wait for the
+ * AP to come online. This also gives a chance
+ * for the MTRR work(triggered by the AP coming online)
+ * to be completed in the stop machine context.
*/
- smpboot_restore_warm_reset_vector();
+ schedule();
}

- return boot_error;
+ /*
+ * Check TSC synchronization with the AP (keep irqs disabled
+ * while doing so):
+ */
+ t3 = get_cycles();
+ local_irq_save(flags);
+ check_tsc_sync_source(cpu);
+ local_irq_restore(flags);
+ t4 = get_cycles();
+ while (!cpu_online(cpu)) {
+ cpu_relax();
+ touch_nmi_watchdog();
+ }
+ t5 = get_cycles();
+
+ printk("CPU#%d up in %10lld,%10lld,%10lld,%10lld (%10lld)\n", cpu,
+ t2-t1, t3-t2, t4-t3, t5-t4, t5-t1);
+ return 0;
}

-int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+int do_cpu_up(unsigned int cpu, struct task_struct *tidle)
{
int apicid = apic->cpu_present_to_apicid(cpu);
int cpu0_nmi_registered = 0;
- unsigned long flags;
int err, ret = 0;

lockdep_assert_irqs_enabled();
@@ -1178,19 +1192,6 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
goto unreg_nmi;
}

- /*
- * Check TSC synchronization with the AP (keep irqs disabled
- * while doing so):
- */
- local_irq_save(flags);
- check_tsc_sync_source(cpu);
- local_irq_restore(flags);
-
- while (!cpu_online(cpu)) {
- cpu_relax();
- touch_nmi_watchdog();
- }
-
unreg_nmi:
/*
* Clean up the nmi handler. Do this after the callin and callout sync
@@ -1202,6 +1203,31 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
return ret;
}

+#define EARLY_INIT
+
+int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+{
+ int ret;
+
+#ifndef EARLY_INIT
+ ret = do_cpu_up(cpu, tidle);
+ if (ret)
+ return ret;
+#endif
+ ret = do_wait_cpu(cpu);
+ return ret;
+}
+
+int __cpu_init(unsigned int cpu, struct task_struct *tidle)
+{
+ int ret = 0;
+
+#ifdef EARLY_INIT
+ ret = do_cpu_up(cpu, tidle);
+#endif
+ return ret;
+}
+
/**
* arch_disable_smp_support() - disables SMP support for x86 at runtime
*/
@@ -1415,6 +1441,13 @@ void __init native_smp_cpus_done(unsigned int max_cpus)
{
pr_debug("Boot done\n");

+ if (x86_platform.legacy.warm_reset) {
+ /*
+ * Cleanup possible dangling ends...
+ */
+ smpboot_restore_warm_reset_vector();
+ }
+
calculate_max_logical_packages();

if (x86_has_numa_in_package)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index bc56287a1ed1..cdea060b1009 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -61,6 +61,7 @@ enum cpuhp_state {
CPUHP_LUSTRE_CFS_DEAD,
CPUHP_AP_ARM_CACHE_B15_RAC_DEAD,
CPUHP_PADATA_DEAD,
+ CPUHP_BRINGUP_KICK_CPU, /* Asynchronously kick/wake/INIT CPU */
CPUHP_WORKQUEUE_PREP,
CPUHP_POWER_NUMA_PREPARE,
CPUHP_HRTIMERS_PREPARE,
@@ -92,7 +93,7 @@ enum cpuhp_state {
CPUHP_MIPS_SOC_PREPARE,
CPUHP_BP_PREPARE_DYN,
CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20,
- CPUHP_BRINGUP_CPU,
+ CPUHP_BRINGUP_CPU, /* Wait for CPU to actually respond */
CPUHP_AP_IDLE_DEAD,
CPUHP_AP_OFFLINE,
CPUHP_AP_SCHED_STARTING,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 2b8d7a5db383..17881f836de6 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -545,11 +545,36 @@ static int bringup_wait_for_ap(unsigned int cpu)
return cpuhp_kick_ap(st, st->target);
}

-static int bringup_cpu(unsigned int cpu)
+extern int __cpu_init(unsigned int cpu, struct task_struct *tidle);
+static int bringup_kick_cpu(unsigned int cpu)
{
struct task_struct *idle = idle_thread_get(cpu);
int ret;
+ cycles_t t = get_cycles();
+ /*
+ * Some architectures have to walk the irq descriptors to
+ * setup the vector space for the cpu which comes online.
+ * Prevent irq alloc/free across the bringup.
+ */
+ irq_lock_sparse();
+
+ /* Arch-specific enabling code. */
+ ret = __cpu_init(cpu, idle);
+ irq_unlock_sparse();
+
+ t = get_cycles() - t;
+ printk("bringup_kick_cpu %d in %ld cycles\n", cpu, t);
+ if (ret)
+ return ret;
+ return 0;
+}

+static int bringup_cpu(unsigned int cpu)
+{
+ struct task_struct *idle = idle_thread_get(cpu);
+ int ret;
+ cycles_t t2, t = get_cycles();
+
/*
* Some architectures have to walk the irq descriptors to
* setup the vector space for the cpu which comes online.
@@ -562,7 +587,12 @@ static int bringup_cpu(unsigned int cpu)
irq_unlock_sparse();
if (ret)
return ret;
- return bringup_wait_for_ap(cpu);
+ t2 = get_cycles() - t;
+ ret = bringup_wait_for_ap(cpu);
+ t = get_cycles() - t;
+ printk("bringup_cpu %d in %ld,%ld cycles\n", cpu, t2, t -t2);
+
+ return ret;
}

static int finish_cpu(unsigned int cpu)
@@ -1336,6 +1366,13 @@ void bringup_nonboot_cpus(unsigned int setup_max_cpus)
{
unsigned int cpu;

+ for_each_present_cpu(cpu) {
+ if (num_online_cpus() >= setup_max_cpus)
+ break;
+ if (!cpu_online(cpu))
+ cpu_up(cpu, CPUHP_BRINGUP_KICK_CPU);
+ }
+
for_each_present_cpu(cpu) {
if (num_online_cpus() >= setup_max_cpus)
break;
@@ -1565,7 +1602,13 @@ static struct cpuhp_step cpuhp_hp_states[] = {
.startup.single = timers_prepare_cpu,
.teardown.single = timers_dead_cpu,
},
- /* Kicks the plugged cpu into life */
+ /* Asynchronously kicks the plugged cpu into life */
+ [CPUHP_BRINGUP_KICK_CPU] = {
+ .name = "cpu:kick",
+ .startup.single = bringup_kick_cpu,
+ .cant_stop = true,
+ },
+ /* Wait for woken CPU to be responding */
[CPUHP_BRINGUP_CPU] = {
.name = "cpu:bringup",
.startup.single = bringup_cpu,
diff --git a/kernel/smp.c b/kernel/smp.c
index 4d17501433be..2d07d1c42789 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -807,6 +807,8 @@ void __init smp_init(void)

pr_info("Bringing up secondary CPUs ...\n");

+ // smp_cpus_start(setup_max_cpus);
+
bringup_nonboot_cpus(setup_max_cpus);

num_nodes = num_online_nodes();

Attachment: smime.p7s
Description: S/MIME cryptographic signature