Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus

From: Rafael J. Wysocki
Date: Wed Jul 06 2016 - 20:28:49 EST


On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:
> Stress test from Varun Koyyalagunta reports that, the
> nonboot CPU would hang occasionally, when resuming from
> hibernation. Further investigation shows that, the precise
> stage when nonboot CPU hangs, is the time when the nonboot
> CPU been woken up incorrectly, and tries to monitor the
> mwait_ptr for the second time, then an exception is
> triggered due to illegal vaddr access, say, something like,
> 'Unable to handler kernel address of 0xffff8800ba800010...'
>
> Further investigation shows that, this exception is caused
> by accessing a page without PRESENT flag, because the pte entry
> for this vaddr is zero. Here's the scenario how this problem
> happens: Page table for direct mapping is allocated dynamically
> by kernel_physical_mapping_init, it is possible that in the
> resume process, when the boot CPU is trying to write back pages
> to their original address, and just right to writes to the monitor
> mwait_ptr then wakes up one of the nonboot CPUs, since the page
> table currently used by the nonboot CPU might not the same as it
> is before the hibernation, an exception might occur due to
> inconsistent page table.
>
> First try is to get rid of this problem by changing the monitor
> address from task.flag to zero page, because no one would write
> data to zero page. But there is still problem because of a ping-pong
> wake up scenario in mwait_play_dead:
>
> One possible implementation of a clflush is a read-invalidate snoop,
> which is what a store might look like, so cflush might break the mwait.
>
> 1. CPU1 wait at zero page
> 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
> 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
> then the nonboot CPUs never sleep for long.
>
> So it's better to monitor different address for each
> nonboot CPUs, however since there is only one zero page, at most:
> PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64
> on a x86_64, apparently it's not enough for servers, maybe more
> zero pages are required.
>
> So choose a new solution as Brian suggested, to put the nonboot CPUs
> into hlt before resume, without touching any memory during s/r.
> Theoretically there might still be some problems if some of the CPUs have
> already been put offline, but since the case is very rare and users
> can work around it, we do not deal with this special case in kernel
> for now.
>
> BTW, as James mentioned, he might want to encapsulate disable_nonboot_cpus
> into arch-specific, so this patch might need small change after that.
>
> Comments and suggestions would be appreciated.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> Reported-and-tested-by: Varun Koyyalagunta <cpudebug@xxxxxxxxxxxx>
> Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>

Below is my sort of version of this (untested) and I did it this way, because
the issue is specific to resume from hibernation (the workaround need not be
applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
be useful to arm64 too if I'm not mistaken (James?).

Actually, if arm64 uses it too, the __weak implementation can be dropped,
because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
(x86 and arm64 are the only users of that).

Thanks,
Rafael


---
arch/x86/include/asm/cpu.h | 2 ++
arch/x86/kernel/smpboot.c | 5 +++++
arch/x86/power/cpu.c | 19 +++++++++++++++++++
kernel/power/hibernate.c | 7 ++++++-
kernel/power/power.h | 2 ++
5 files changed, 34 insertions(+), 1 deletion(-)

Index: linux-pm/kernel/power/hibernate.c
===================================================================
--- linux-pm.orig/kernel/power/hibernate.c
+++ linux-pm/kernel/power/hibernate.c
@@ -409,6 +409,11 @@ int hibernation_snapshot(int platform_mo
goto Close;
}

+int __weak hibernate_resume_nonboot_cpu_disable(void)
+{
+ return disable_nonboot_cpus();
+}
+
/**
* resume_target_kernel - Restore system state from a hibernation image.
* @platform_mode: Whether or not to use the platform driver.
@@ -433,7 +438,7 @@ static int resume_target_kernel(bool pla
if (error)
goto Cleanup;

- error = disable_nonboot_cpus();
+ error = hibernate_resume_nonboot_cpu_disable();
if (error)
goto Enable_cpus;

Index: linux-pm/kernel/power/power.h
===================================================================
--- linux-pm.orig/kernel/power/power.h
+++ linux-pm/kernel/power/power.h
@@ -38,6 +38,8 @@ static inline char *check_image_kernel(s
}
#endif /* CONFIG_ARCH_HIBERNATION_HEADER */

+extern int hibernate_resume_nonboot_cpu_disable(void);
+
/*
* Keep some memory free so that I/O operations can succeed without paging
* [Might this be more than 4 MB?]
Index: linux-pm/arch/x86/power/cpu.c
===================================================================
--- linux-pm.orig/arch/x86/power/cpu.c
+++ linux-pm/arch/x86/power/cpu.c
@@ -266,6 +266,25 @@ void notrace restore_processor_state(voi
EXPORT_SYMBOL(restore_processor_state);
#endif

+#if defined(CONFIG_HIBERNATION) && defined(CONFIG_HOTPLUG_CPU)
+int hibernate_resume_nonboot_cpu_disable(void)
+{
+ int ret;
+
+ /*
+ * Ensure that MONITOR/MWAIT will not be used in the "play dead" loop
+ * during image restoration, because it is likely that the monitored
+ * address will be actually written to at that time and then the "dead"
+ * CPU may start executing instructions from an image kernel's page
+ * (and that may not be the "play dead" loop any more).
+ */
+ force_hlt_play_dead = true;
+ ret = disable_nonboot_cpus();
+ force_hlt_play_dead = false;
+ return ret;
+}
+#endif
+
/*
* When bsp_check() is called in hibernate and suspend, cpu hotplug
* is disabled already. So it's unnessary to handle race condition between
Index: linux-pm/arch/x86/kernel/smpboot.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/smpboot.c
+++ linux-pm/arch/x86/kernel/smpboot.c
@@ -1441,6 +1441,8 @@ __init void prefill_possible_map(void)

#ifdef CONFIG_HOTPLUG_CPU

+bool force_hlt_play_dead;
+
static void remove_siblinginfo(int cpu)
{
int sibling;
@@ -1642,6 +1644,9 @@ void native_play_dead(void)
play_dead_common();
tboot_shutdown(TB_SHUTDOWN_WFS);

+ if (force_hlt_play_dead)
+ hlt_play_dead();
+
mwait_play_dead(); /* Only returns on failure */
if (cpuidle_play_dead())
hlt_play_dead();
Index: linux-pm/arch/x86/include/asm/cpu.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/cpu.h
+++ linux-pm/arch/x86/include/asm/cpu.h
@@ -26,6 +26,8 @@ struct x86_cpu {
};

#ifdef CONFIG_HOTPLUG_CPU
+extern bool force_hlt_play_dead;
+
extern int arch_register_cpu(int num);
extern void arch_unregister_cpu(int);
extern void start_cpu0(void);