Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error

From: Huang Ying
Date: Thu Oct 21 2010 - 22:05:21 EST


On Fri, 2010-10-22 at 09:49 +0800, Don Zickus wrote:
> On Thu, Oct 21, 2010 at 05:45:44PM +0200, Andi Kleen wrote:
> > On Thu, Oct 21, 2010 at 10:10:02AM -0400, Don Zickus wrote:
> > > On Thu, Oct 21, 2010 at 01:17:31PM +0800, Huang Ying wrote:
> > > > > > But there is some general rules for unknown NMI. We think unknown NMI is
> > > > > > hardware error notification on all systems except systems with broken
> > > > > > hardware or software bugs, stone age machines. Do you agree with that?
> > > > >
> > > > > Nope. In my experiences, most of our customers are still running
> > > > > pre-Nehalem boxes, therefore most unknown NMIs are from broken hardware or
> > > > > bad firmware (at least in the bugzillas I deal with).
> > > >
> > > > It seems that we have different point of view for reason of unknown NMI.
> > > > Should broken hardware be seen as hardware error?
> > >
> > > Well, do you have an alternative way to handle broken hardware? Broken
> > > hardware has generated NMIs, sometimes if I am lucky SERRs. The ones that
> > > generate SERRs can be filtered through a different path, but what about
> > > the ones that don't?
> > >
> >
> > Don, AFAIK you're saying the same thing as Ying: an unknown NMI is
> > a hardware error.
> >
> > The reason the hardware does that is that it wants to tell us:
> >
> > "I lost track of an error. There is corrupted data somewhere in the system.
> > Please stop, don't do anything that could consume that data. S.O.S."
> >
> > The correct answer for that is panic.
>
> After re-reading Huang's patch, I am starting to understand what you mean
> by broken hardware. Basically you are trying to distinguish between
> legacy systems that were 'broken' in the sense they would randomly send
> uknown NMIs for no good reason, hence the 'Dazed and confused' messages
> and hardware errors on more modern systems that say, 'Hardware error,
> panicing check your BIOS for more info' (or whatever).

Yes.

> So Huang's patch was sort of acting like a switch. On legacy systems use
> 'Dazed and confused' for unknown NMIs. Whereas on whitelisted modern
> systems use a more relavant 'Check BIOS for error' message. Is that
> right?

In fact we want to go panic and 'check BIOS for error, contact your
hardware vendor' for all systems. But as you said, there are some
'broken hardware' randomly send unknown NMIs for no good reason. So a
white list is used for them. And not all pre-Nehalem machines are
'broken' in fact.

> That's why you guys are complaining that registering a die_notifier would
> be silly?

I think whether going die_notifier or unknown_nmi_error() depends on it
is general or specific for some hardware. Do you agree with that?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/