Re: [PATCH] x86: avoid calling arch_trigger_all_cpu_backtrace() atthe same time on SMP

From: DDD
Date: Wed Nov 10 2010 - 03:36:18 EST


Ingo Molnar wrote:
* Don Zickus <dzickus@xxxxxxxxxx> wrote:

From: Dongdong Deng <dongdong.deng@xxxxxxxxxxxxx>

The spin_lock_debug/rcu_cpu_stall detector uses
trigger_all_cpu_backtrace() to dump cpu backtrace.
Therefore it is possible that trigger_all_cpu_backtrace()
could be called at the same time on different CPUs, which
triggers and 'unknown reason NMI' warning. The following case
illustrates the problem:

CPU1 CPU2 ... CPU N
trigger_all_cpu_backtrace()
set "backtrace_mask" to cpu mask
|
generate NMI interrupts generate NMI interrupts ...
\ | /
\ | /
The "backtrace_mask" will be cleaned by the first NMI interrupt
at nmi_watchdog_tick(), then the following NMI interrupts generated
by other cpus's arch_trigger_all_cpu_backtrace() will be took as
unknown reason NMI interrupts.

This patch uses a lock to avoid the problem, and stop the
arch_trigger_all_cpu_backtrace() calling to avoid dumping double cpu
backtrace info when there is already a trigger_all_cpu_backtrace()
in progress.

Signed-off-by: Dongdong Deng <dongdong.deng@xxxxxxxxxxxxx>
Reviewed-by: Bruce Ashfield <bruce.ashfield@xxxxxxxxxxxxx>
CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
CC: Ingo Molnar <mingo@xxxxxxxxxx>
CC: "H. Peter Anvin" <hpa@xxxxxxxxx>
CC: x86@xxxxxxxxxx
CC: linux-kernel@xxxxxxxxxxxxxxx
Signed-off-by: Don Zickus <dzickus@xxxxxxxxxx>
---
arch/x86/kernel/apic/hw_nmi.c | 14 ++++++++++++++
arch/x86/kernel/apic/nmi.c | 14 ++++++++++++++
2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index cefd694..3aea0a5 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -29,6 +29,16 @@ u64 hw_nmi_get_sample_period(void)
void arch_trigger_all_cpu_backtrace(void)
{
int i;
+ static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;

Please dont put statics into the middle of local variables - put them into file scope in a visible way.

Got it. will change it.


+ unsigned long flags;
+
+ local_irq_save(flags);
+ if (!arch_spin_trylock(&lock))
+ /*
+ * If there is already a trigger_all_cpu_backtrace()
+ * in progress, don't output double cpu dump infos.
+ */
+ goto out_restore_irq;
cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
@@ -41,6 +51,10 @@ void arch_trigger_all_cpu_backtrace(void)
break;
mdelay(1);
}
+
+ arch_spin_unlock(&lock);
+out_restore_irq:
+ local_irq_restore(flags);
}
static int __kprobes
diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c
index a43f71c..5fa8a13 100644
--- a/arch/x86/kernel/apic/nmi.c
+++ b/arch/x86/kernel/apic/nmi.c
@@ -552,6 +552,16 @@ int do_nmi_callback(struct pt_regs *regs, int cpu)
void arch_trigger_all_cpu_backtrace(void)
{
int i;
+ static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ if (!arch_spin_trylock(&lock))
+ /*
+ * If there is already a trigger_all_cpu_backtrace()
+ * in progress, don't output double cpu dump infos.
+ */
+ goto out_restore_irq;
cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
@@ -564,4 +574,8 @@ void arch_trigger_all_cpu_backtrace(void)
break;
mdelay(1);
}
+
+ arch_spin_unlock(&lock);
+out_restore_irq:
+ local_irq_restore(flags);

This spinlock is never actually used as a spinlock - it's a "in progress" flag. Why not use a flag and test_and_set_bit()?

Yep, it's a "in progress" flag, I will change to use a flag to replace the spinlock.


Also, the irq disabling really needed, will this code ever be called with irqs on?

This code could be called with irqs on.

If the arch_trigger_all_cpu_backtrace() was triggered by
"spin_lock()"'s spin_lock_debug detector, it is possible that
the irq is enabled, thus we have to save and disable it here.


Thanks,

Dongdong


Thanks,

Ingo


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/