Re: [PATCH] Add mem_nmi_panic enable system to panic on hard error

From: Hidetoshi Seto
Date: Tue Dec 13 2005 - 03:56:22 EST


Thanks Dave,

Dave Jones wrote:
Hmm, are you sure this isn't a bios error misconfiguring
some northbridge register perhaps ? Some chipsets offer
such reporting as a feature. Could be your server has this
on by default.

I had PCI-clipping tests on our servers.
On injected error, I confirmed that some of them actually
asserts NMI with the reason bit, and logs PCI parity error
to its SEL. (And rests, some having old chipsets, also logs
to SEL but asserts NMI with no reason bits, aka unknown NMI.)
Yes, it's true that not all server support the NMI reporting.

(I believe the EDAC code has also triggered similar cases
on certain cards which is why it too disables this checking
by default).

I'm not sure but there could be a special card and card driver
that triggers such NMI but can handle/recover the error.
Also I'm not sure why linux had not have "nmi_panic" but only
"unknown_nmi_panic" that have no effects on reasoned NMI.
...Would someone let me know?

Why not make this automatic based on dmi strings, instead of
making the user guess that he needs to pass obscure command
line options?

The sysctl seems pointless too. If this is needed at all,
why would you ever want to turn it off ?

Frankly, this is a kind of port from RHEL3.
Maybe as you know, RHEL3 has "mem_nmi_panic" sysctl.
Of course it is useful for me. That's why the patch is here.

I agree that some server will require this on by default.
However this will not be work with oprofile, and I think this is
not the time to concrete NMI handling.
So now mem_nmi_panic I suggest is just duplicated one of existing
unknown_nmi_panic.

H.Seto

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/