Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

From: Borislav Petkov
Date: Tue Jan 10 2023 - 07:19:24 EST


On Tue, Jan 10, 2023 at 01:11:29PM +0100, Borislav Petkov wrote:
> On Tue, Jan 10, 2023 at 01:01:06PM +0100, Ingo Molnar wrote:
> > From: Zeng Heng <zengheng4@xxxxxxxxxx>
> > Date: Tue, 10 Jan 2023 18:27:45 +0800
> > Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes
> >
> > If kdump is enabled, when using mce_inject to inject errors, EFI
>
> Why does "EFI" matter here? Any boot loader would do...
>
> > boot loader would decompress & load second kernel for saving the
>
> s/&/and/
>
> > vmcore file.
> >
> > For normal errors that is fine.
>
> Useless sentence.
>
> > However, in the MCE case, the panic
> > CPU that firstly enters into mce_panic() is running within NMI
> > interrupt context,
>
> "#MC context" it is non-maskable but that's not "NMI interrupt context"
>
> > and the processor blocks delivery of subsequent
> > NMIs until the next execution of the IRET instruction.
> >
> > When the panic CPU takes long time in the panic processing route,
>
> I'm still unclear on the order of events here. It sounds like
>
> 1. MCE injected
> 2. panic
> 3. kdump gets loaded
>
> If that is the case, then I presume the flow is:
>
> mce_panic -> panic -> __crash_kexec()
>
> Yes?
>
> If so, then we should make sure we have *exited* #MC context before calling
> panic() and not have to add hacks like this one of adding an empty NMI handler.
>
> But I'm only speculating as it is hard to make sense of all this text.

IOW, does this help?

---
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7832a69d170e..55437d8a4fad 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -287,6 +287,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
panic(msg);
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);



--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette