Re: [debug PATCHes] Re: smp_call_function_single lockups

From: Ingo Molnar
Date: Wed Apr 01 2015 - 08:39:24 EST



* Chris J Arges <chris.j.arges@xxxxxxxxxxxxx> wrote:

> This was only tested only on the L1, so I can put this on the L0 host and run
> this as well. The results:
>
> [ 124.897002] apic: vector c1, new-domain move in progress
> [ 124.954827] apic: vector d1, sent cleanup vector, move completed
> [ 163.477270] apic: vector d1, new-domain move in progress
> [ 164.041938] apic: vector e1, sent cleanup vector, move completed
> [ 213.466971] apic: vector e1, new-domain move in progress
> [ 213.775639] apic: vector 22, sent cleanup vector, move completed
> [ 365.996747] apic: vector 22, new-domain move in progress
> [ 366.011136] apic: vector 42, sent cleanup vector, move completed
> [ 393.836032] apic: vector 42, new-domain move in progress
> [ 393.837727] apic: vector 52, sent cleanup vector, move completed
> [ 454.977514] apic: vector 52, new-domain move in progress
> [ 454.978880] apic: vector 62, sent cleanup vector, move completed
> [ 467.055730] apic: vector 62, new-domain move in progress
> [ 467.058129] apic: vector 72, sent cleanup vector, move completed
> [ 545.280125] apic: vector 72, new-domain move in progress
> [ 545.282801] apic: vector 82, sent cleanup vector, move completed
> [ 567.631652] apic: vector 82, new-domain move in progress
> [ 567.632207] apic: vector 92, sent cleanup vector, move completed
> [ 628.940638] apic: vector 92, new-domain move in progress
> [ 628.965274] apic: vector a2, sent cleanup vector, move completed
> [ 635.187433] apic: vector a2, new-domain move in progress
> [ 635.191643] apic: vector b2, sent cleanup vector, move completed
> [ 673.548020] apic: vector b2, new-domain move in progress
> [ 673.553843] apic: vector c2, sent cleanup vector, move completed
> [ 688.221906] apic: vector c2, new-domain move in progress
> [ 688.229487] apic: vector d2, sent cleanup vector, move completed
> [ 723.818916] apic: vector d2, new-domain move in progress
> [ 723.828970] apic: vector e2, sent cleanup vector, move completed
> [ 733.485435] apic: vector e2, new-domain move in progress
> [ 733.615007] apic: vector 23, sent cleanup vector, move completed
> [ 824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26]

Are these all the messages? Looks like Linus's warnings went away, or
did you filter them out?

But ... the affinity setting message does not appear to trigger, and
that's the only real race I can see in the code. Also, the frequency
of these messages appears to be low, while the race window is narrow.
So I'm not sure the problem is related to the irq-move mechanism.

One thing that appears to be weird: why is there irq-movement activity
to begin with? Is something changing irq-affinities?

Could you put a dump_stack() into the call? Something like the patch
below, in addition to all patches so far. (if it conflicts with the
previous debugging patches then just add the code manually to after
the debug printout.)

Thanks,

Ingo

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..79d6de6fdf0a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
cfg->move_in_progress =
cpumask_intersects(cfg->old_domain, cpu_online_mask);
cpumask_and(cfg->domain, cfg->domain, tmp_mask);
+ if (cfg->move_in_progress)
+ dump_stack();
break;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/