Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
From: Thomas Gleixner
Date: Sun Jul 27 2025 - 16:01:14 EST
On Wed, Jun 04 2025 at 08:33, Yipeng Zou wrote:
> Recently, A issue has been reported that CPU hang in x86 VM.
>
> The CPU halted during Kdump likely due to IPI issues when one CPU was
> rebooting and another was in Kdump:
>
> CPU0 CPU2
> ======================== ======================
> reboot Panic
> machine shutdown Kdump
> machine shutdown
> stop other cpus
> stop other cpus
> ... ...
> local_irq_disable local_irq_disable
> send_IPIs(REBOOT) [critical regions]
> [critical regions] 1) send_IPIs(REBOOT)
After staring more at it, this makes absolutely no sense at all.
stop_other_cpus() does:
/* Only proceed if this is the first CPU to reach this code */
old_cpu = -1;
this_cpu = smp_processor_id();
if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
return;
So CPU2 _cannot_ reach the code, which issues the reboot IPIs, because
at that point @stopping_cpu == 0 ergo the cmpxchg() fails.
So what actually happens in this case is:
CPU0 CPU2
======================== ======================
reboot Panic
machine shutdown Kdump
machine_crash_shutdown()
stop other cpus local_irq_disable()
try_cmpxchg() succeeds stop other cpus
... try_cmpxchg() fails
send_IPIs(REBOOT) --> REBOOT vector becomes pending in IRR
wait timeout
And from there on everything becomes a lottery as CPU0 continues to
execute and CPU2 proceeds and jumps into the crash kernel...
This whole logic is broken...
Nevertheless the patch I sent earlier is definitely making things more
robust, but it won't solve your particular problem.
Thanks,
tglx