Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

From: Huang Ying
Date: Tue May 17 2011 - 03:41:17 EST


On 05/16/2011 07:29 PM, Ingo Molnar wrote:
>
> * Don Zickus <dzickus@xxxxxxxxxx> wrote:
>
>> On Fri, May 13, 2011 at 05:20:33PM +0200, Ingo Molnar wrote:
>>>
>>> * huang ying <huang.ying.caritas@xxxxxxxxx> wrote:
>>>
>>>>> What should be done instead is to add an event for unknown NMIs, which can
>>>>> then be processed by the RAS daemon to implement policy.
>>>>>
>>>>> By using 'active' event filters it could even be set on a system to panic
>>>>> the box by default.
>>>>
>>>> If there is real fatal hardware error, maybe we have no luxury to go from NMI
>>>> handler to user space RAS daemon to determine what to do. System may explode,
>>>> bad data may go to disk before that.
>>>
>>> That is why i suggested:
>>>
>>> > > By using 'active' event filters it could even be set on a system to panic
>>> > > the box by default.
>>>
>>> event filters are evaluated in the kernel, so the panic could be instantaneous,
>>> without the event having to reach user-space.
>>
>> Interesting. Question though, what do you mean by 'event filtering'. Is
>> that different then setting 'unknown_nmi_panic' panic on the commandline or
>> procfs?
>>
>> Or are you suggesting something like registering another callback on the
>> die_chain that looks for DIE_NMIUNKNOWN as the event, swallows them and
>> implements the policy? That way only on HEST related platforms would
>> register them while others would keep the default of 'Dazed and confused'
>> messages?
>
> The idea is that "event filters", which are an existing upstream feature and
> which can be used in rather flexible ways:
>
> http://lkml.org/lkml/2011/4/27/660
>
> Could be used to trigger non-standard policy action as well - such as to panic
> the box.
>
> This would replace various very limited /debugfs and /sys event filtering hacks
> (and hardcoded policies) such as arch/x86/kernel/cpu/mcheck/mce-severity.c, and
> it would allow nonstandard behavior like 'panic the box on unknown NMIs' as
> well.
>
> This could be set by the RAS daemon, and it could be propagated to the kernel
> boot line as well, where event filter syntax would look like this:
>
> events=nmi::unknown"if (reason == 0) panic();"
>
> (Where the 'reason' field of the NMI event is the current legacy 'reason' value
> there.)
>
> The filter code would have to be modified to be able to recognize the panic()
> bit, but that's desirable anyway and it is a one-time effort.
>
> This:
>
> events=nmi::unknown:"if (reason == 0) ignore();"
>
> would be a possible outcome as well, on certain boxes - to skip certain events.

We can determine whether NMI is unknown in kernel now. If you want to
push all unknown NMI logic into user space (although I don't think that
is the best solution), is it not sufficient that just check system in
user space (via PCI ID or DMI ID, etc) and set existing
"unknown_nmi_panic" accordingly?

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/