Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants

From: huang ying
Date: Thu Sep 23 2010 - 05:30:04 EST


Hi, Don,

On Thu, Sep 23, 2010 at 12:07 AM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> On Wed, Sep 22, 2010 at 12:19:16AM +0200, Andi Kleen wrote:
>>
>> >
>> > I guess adding either another knob to override the hardware error option
>> > or tying it in with the panic_on_unknown_error option might make me more
>> > comfortable. ÂThat way enterprise customers can always just enable it by
>> > default and desktop users (for now) could have it off.
>>
>> Anything that needs explicit enabling is a bad idea, that
>> would lead to a lot of users running in "corrupt my data" mode.
>
> I know. ÂBut as I said earlier in my emails, I am trying to figure out how
> to deal with the fallout from unknown nmis turning into panics. ÂToday
> people see unknown nmis. ÂThey may or may not be corrupting data. ÂThey
> usually file a bug. ÂCurrently it is hard for me to diagnosis the problem.
> Usually the old 'upgrade your bios/firmware' does the trick. ÂSometimes it
> doesn't and people feel like their machines still run fine. ÂSo they
> ignore it (for good or for bad).
>
> Turning unknown nmis into panics would break their current setup without
> much gain. ÂSo I was trying to propose something temporarily until we
> could get a better infrastructure to isolate the source and provide better
> info on what to do.
>
> I agree with you that long term unknown nmis should be panics. ÂI just get
> nervous about doing that now from a support perspective.

In fact, we use white list policy here. Only systems with HEST or
identified by chipset host bridge PCI ID will panic for unknown NMI.
So I think systems you worried about will not have this enabled.

>> The code currently uses the presence of a HEST error table
>> to detect a server. HEST should be only available on servers.
>>
>> On servers at least panic should be default.
>
> Ok. ÂThat's fine. But then what. ÂWhat does a developer do with that
> panic? ÂThere's no useful info. ÂThat is sorta my problem. ÂThen again I
> do not know much about HEST.

On some system, there is some hardware error log in BMC/BIOS. The
hardware error log can be gotten via IPMI or BIOS menu. Otherwise, can
we get some useful info for unknown NMI? If we can, can we collect the
info, then print it on console and save it into flash via ERST (part
of APEI too) before panic?

HEST is defined in ACPI spec 4.0 and later version in section named
APEI (ACPI Platform Error Interface). It is used to describe the error
sources of system. It should be available only on server platform.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/