Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

From: Cyrill Gorcunov
Date: Mon May 16 2011 - 15:53:10 EST


On 05/16/2011 11:03 PM, Don Zickus wrote:
> On Mon, May 16, 2011 at 09:09:45AM +0800, Huang Ying wrote:
>>> Ying, the concern is rather related to the code scheme in general. Since
>>> we have notifiers I think the better way to be consistent here and use
>>> hwerr notifier too. But it's IMHO ;)
>>
>> As for go notifiers or not. IMHO, a rule can be:
>>
>> - If it is something like a driver, than it should go notifier
>> - If it is architectural/PC defacto standard, it can sit outside of
>> notifier.
>
> Hmm, then what do you do about perf? That is architectural and a defacto
> standard, but I am not sure hardcoding that would be appropriate.

Good point!

>
>>
>> I think that seeing unknown NMI as hardware error should be part of PC
>> defacto standard. Do you think so?
>
> Well after thinking about it, I would say no. And my reason is, if
> vendors are really serious about using NMIs as an indicator for hardware
> errors, shouldn't they be setting a bit in the memory controller/north
> bridge or south bridge/IOHC for an NMI handler to read? I mean hardware

UV platform has such bit iirc :)

> devices don't just get wired directly to the NMI pin on the cpu, right?
> They generally have to go through some hub that acts as a multiplexer.
>
> In those cases, why can't those hubs set a bit saying it detected an error
> (don't PCIe bridges already do that?) and let the NMI handler read it to
> confirm. This way we can leave 'unknown NMIs' as a way to say an
> unclaimed NMI entered the system and we can have users set policy about
> what to do, panic, printk, whatever.
>
> But for the HEST stuff, it should be smart enough by now to trap any
> hardware error, no? How does a machine that supports HEST let a hardware
> error get through without detecting it? Isn't that the point? Detect a
> hardware error, grab as much info about it as possible, save the error
> record and then panic?
>
> Otherwise if you just panic, then you have no idea why the machine errored
> in the first place. It might be the safe thing to do in some
> circumstances, but then you have to wonder why the fancy HEST enabled
> server didn't catch it. Isn't that what people are spending extra money
> for those Intel servers with RAS features?
>
> Cheers,
> Don

--
Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/