Re: [PATCH -v2 4/7] x86, NMI, Rewrite NMI handler

From: huang ying
Date: Mon Sep 27 2010 - 08:39:30 EST


Hi, Robert,

On Mon, Sep 27, 2010 at 5:41 PM, Robert Richter <robert.richter@xxxxxxx> wrote:
> On 26.09.10 20:57:03, Huang Ying wrote:
>> The original NMI handler is quite outdated in many aspects. This patch
>> try to fix it.
>>
>> The order to process the NMI sources are changed as follow:
>>
>> notify_die(DIE_NMI_IPI);
>> notify_die(DIE_NMI);
>> /* process io port 0x61 */
>> nmi_watchdog_touch();
>> notify_die(DIE_NMIUNKNOWN);
>> unknown_nmi();
>>
>> DIE_NMI_IPI is used to process CPU specific NMI sources, such as perf
>> event, oprofile, crash IPI, etc. While DIE_NMI is used to process
>> non-CPU-specific NMI sources, such as APEI (ACPI Platform Error
>> Interface) GHES (Generic Hardware Error Source), etc. Non-CPU-specific
>> NMI sources can be processed on any CPU,
>>
>> DIE_NMI_IPI must be processed before DIE_NMI. For example, perf event
>> trigger a NMI on CPU 1, at the same time, APEI GHES trigger another
>> NMI on CPU 0. If DIE_NMI is processed before DIE_NMI_IPI, it is
>> possible that APEI GHES is processed on CPU 1, while unknown NMI is
>> gotten on CPU 0.
>
> I think macro names DIE_NMI_IPI and DIE_NMI should be swapped as
> e.g. the perf nmi is actually local and non-IPI.

DIE_NMI_IPI may be not a good name for perf, but DIE_NMI is a even
worse name for perf! DIE_NMI is originally used for IOCHK and PCI SERR
NMI.

> We might consider to rework the IPI thing completly, but may be in a
> follow-on patch.
>
>>
>> In this new order of processing, performance sensitive NMI sources
>> such as oprofile or perf event will have better performance because
>> the time consuming IO port reading is done after them.
>>
>> Only one NMI is eaten for each NMI handler call, even for PCI SERR and
>> IOCHK NMIs. Because one NMI should be raised for each of them, eating
>> too many NMI will cause unnecessary unknown NMI.
>>
>> The die value used in NMI sources are fixed accordingly.
>>
>> The NMI handler in the patch is designed by Andi Kleen.
>>
>>
>> v2:
>>
>> - Split process NMI reason (0x61) on non-BSP into another patch
>>
>> Signed-off-by: Huang Ying <ying.huang@xxxxxxxxx>
>> ---
>> Âarch/x86/kernel/cpu/perf_event.c Â| Â Â1
>> Âarch/x86/kernel/traps.c      |  80 +++++++++++++++++++-------------------
>> Âarch/x86/oprofile/nmi_int.c    |  Â1
>> Âarch/x86/oprofile/nmi_timer_int.c | Â Â2
>> Âdrivers/char/ipmi/ipmi_watchdog.c | Â Â2
>> Âdrivers/watchdog/hpwdt.c     Â|  Â2
>> Â6 files changed, 43 insertions(+), 45 deletions(-)
>>
>> --- a/arch/x86/kernel/cpu/perf_event.c
>> +++ b/arch/x86/kernel/cpu/perf_event.c
>> @@ -1247,7 +1247,6 @@ perf_event_nmi_handler(struct notifier_b
>> Â Â Â Â Â Â Â return NOTIFY_DONE;
>>
>> Â Â Â switch (cmd) {
>> - Â Â case DIE_NMI:
>> Â Â Â case DIE_NMI_IPI:
>
> See my comment above. Same is true for oprofile and some other
> handlers below. It isn't an IPI and should be case DIE_NMI: instead.
>
>> Â Â Â Â Â Â Â break;
>> Â Â Â case DIE_NMIUNKNOWN:
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -354,9 +354,6 @@ io_check_error(unsigned char reason, str
>> Âstatic notrace __kprobes void
>> Âunknown_nmi_error(unsigned char reason, struct pt_regs *regs)
>> Â{
>> - Â Â if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
>> - Â Â Â Â Â Â Â Â Â Â NOTIFY_STOP)
>> - Â Â Â Â Â Â return;
>> Â#ifdef CONFIG_MCA
>> Â Â Â /*
>> Â Â Â Â* Might actually be able to figure out what the guilty party
>> @@ -385,51 +382,54 @@ static notrace __kprobes void default_do
>>
>> Â Â Â cpu = smp_processor_id();
>
> This should go to if (!cpu) and maybe we drop variable cpu completly.

The variable cpu is dropped in 5/7.

>>
>> - Â Â /* Only the BSP gets external NMIs from the system. */
>> - Â Â if (!cpu)
>> - Â Â Â Â Â Â reason = get_nmi_reason();
>> + Â Â /*
>> + Â Â Â* CPU-specific NMI must be processed before non-CPU-specific
>> + Â Â Â* NMI, otherwise we may lose it, because the CPU-specific
>> + Â Â Â* NMI can not be detected/processed on other CPUs.
>> + Â Â Â*/
>>
>> - Â Â if (!(reason & NMI_REASON_MASK)) {
>> - Â Â Â Â Â Â if (notify_die(DIE_NMI_IPI, "nmi_ipi", regs, reason, 2, SIGINT)
>> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â == NOTIFY_STOP)
>> - Â Â Â Â Â Â Â Â Â Â return;
>> + Â Â /*
>> + Â Â Â* CPU-specific NMI: send to specific CPU or NMI sources must
>> + Â Â Â* be processed on specific CPU
>> + Â Â Â*/
>> + Â Â if (notify_die(DIE_NMI_IPI, "nmi_ipi", regs, 0, 2, SIGINT)
>> + Â Â Â Â == NOTIFY_STOP)
>> + Â Â Â Â Â Â return;
>>
>> -#ifdef CONFIG_X86_LOCAL_APIC
>
> Are you sure we may drop this option?

Yes. DIE_NMI is used for non-CPU-specific NMI sources now.

>> - Â Â Â Â Â Â if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
>> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â == NOTIFY_STOP)
>> - Â Â Â Â Â Â Â Â Â Â return;
>> + Â Â /* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
>> + Â Â if (notify_die(DIE_NMI, "nmi", regs, 0, 2, SIGINT) == NOTIFY_STOP)
>> + Â Â Â Â Â Â return;
>
> As said, IPI and non-IPI are mixed up.

They are processed one after the other.

>>
>> -#ifndef CONFIG_LOCKUP_DETECTOR
>> - Â Â Â Â Â Â /*
>> - Â Â Â Â Â Â Â* Ok, so this is none of the documented NMI sources,
>> - Â Â Â Â Â Â Â* so it must be the NMI watchdog.
>> - Â Â Â Â Â Â Â*/
>> - Â Â Â Â Â Â if (nmi_watchdog_tick(regs, reason))
>> - Â Â Â Â Â Â Â Â Â Â return;
>> - Â Â Â Â Â Â if (!do_nmi_callback(regs, cpu))
>> -#endif /* !CONFIG_LOCKUP_DETECTOR */
>> - Â Â Â Â Â Â Â Â Â Â unknown_nmi_error(reason, regs);
>> -#else
>> - Â Â Â Â Â Â unknown_nmi_error(reason, regs);
>> + Â Â if (!cpu) {
>> + Â Â Â Â Â Â reason = get_nmi_reason();
>> + Â Â Â Â Â Â if (reason & NMI_REASON_MASK) {
>> + Â Â Â Â Â Â Â Â Â Â if (reason & NMI_REASON_SERR)
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â pci_serr_error(reason, regs);
>> + Â Â Â Â Â Â Â Â Â Â else if (reason & NMI_REASON_IOCHK)
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â io_check_error(reason, regs);
>> +#ifdef CONFIG_X86_32
>> + Â Â Â Â Â Â Â Â Â Â /*
>> + Â Â Â Â Â Â Â Â Â Â Â* Reassert NMI in case it became active
>> + Â Â Â Â Â Â Â Â Â Â Â* meanwhile as it's edge-triggered:
>> + Â Â Â Â Â Â Â Â Â Â Â*/
>> + Â Â Â Â Â Â Â Â Â Â reassert_nmi();
>> Â#endif
>> + Â Â Â Â Â Â Â Â Â Â return;
>> + Â Â Â Â Â Â }
>> + Â Â }
>>
>> +#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_LOCKUP_DETECTOR)
>> + Â Â if (nmi_watchdog_tick(regs, reason))
>> Â Â Â Â Â Â Â return;
>> - Â Â }
>> - Â Â if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) == NOTIFY_STOP)
>> + Â Â if (do_nmi_callback(regs, smp_processor_id()))
>> Â Â Â Â Â Â Â return;
>> -
>> - Â Â /* AK: following checks seem to be broken on modern chipsets. FIXME */
>> - Â Â if (reason & NMI_REASON_SERR)
>> - Â Â Â Â Â Â pci_serr_error(reason, regs);
>> - Â Â if (reason & NMI_REASON_IOCHK)
>> - Â Â Â Â Â Â io_check_error(reason, regs);
>> -#ifdef CONFIG_X86_32
>> - Â Â /*
>> - Â Â Â* Reassert NMI in case it became active meanwhile
>> - Â Â Â* as it's edge-triggered:
>> - Â Â Â*/
>> - Â Â reassert_nmi();
>> Â#endif
>> +
>> + Â Â if (notify_die(DIE_NMIUNKNOWN, "nmi_unknown", regs, reason, 2, SIGINT)
>> + Â Â Â Â == NOTIFY_STOP)
>> + Â Â Â Â Â Â return;
>> +
>> + Â Â unknown_nmi_error(reason, regs);
>> Â}
>>
>> Âdotraplinkage notrace __kprobes void
>> --- a/arch/x86/oprofile/nmi_int.c
>> +++ b/arch/x86/oprofile/nmi_int.c
>> @@ -64,7 +64,6 @@ static int profile_exceptions_notify(str
>> Â Â Â int ret = NOTIFY_DONE;
>>
>> Â Â Â switch (val) {
>> - Â Â case DIE_NMI:
>> Â Â Â case DIE_NMI_IPI:
>> Â Â Â Â Â Â Â if (ctr_running)
>> Â Â Â Â Â Â Â Â Â Â Â model->check_ctrs(args->regs, &__get_cpu_var(cpu_msrs));
>> --- a/arch/x86/oprofile/nmi_timer_int.c
>> +++ b/arch/x86/oprofile/nmi_timer_int.c
>> @@ -25,7 +25,7 @@ static int profile_timer_exceptions_noti
>> Â Â Â int ret = NOTIFY_DONE;
>>
>> Â Â Â switch (val) {
>> - Â Â case DIE_NMI:
>> + Â Â case DIE_NMI_IPI:
>> Â Â Â Â Â Â Â oprofile_add_sample(args->regs, 0);
>> Â Â Â Â Â Â Â ret = NOTIFY_STOP;
>> Â Â Â Â Â Â Â break;
>> --- a/drivers/char/ipmi/ipmi_watchdog.c
>> +++ b/drivers/char/ipmi/ipmi_watchdog.c
>> @@ -1080,7 +1080,7 @@ ipmi_nmi(struct notifier_block *self, un
>> Â{
>> Â Â Â struct die_args *args = data;
>>
>> - Â Â if (val != DIE_NMI)
>> + Â Â if (val != DIE_NMIUNKNOWN)

All watchdogs use DIE_NMIUNKNOWN in this patch. Because they should be
processed after CPU specific and non-CPU-specific NMIs. Or we define a
special DIE_NMI_XX for it? like DIE_NMI_WATCHDOG?

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/