Re: [RFC PATCH 0/4] watchdog: hpwdt: Fix NMI-related behaviour when CONFIG_HPWDT_NMI_DECODING is enabled

From: Jerry Hoemann
Date: Tue Jan 15 2019 - 22:52:42 EST


On Mon, Jan 14, 2019 at 07:36:13AM +0500, Ivan Mironov wrote:
> Hi,
>
> I found out that hpwdt alters NMI behaviour unexpectedly if compiled
> with enabled CONFIG_HPWDT_NMI_DECODING:
>
> * System starts to panic on any NMI with misleading message.

hpwdt doesn't start to panic on any NMI. It starts to panic on:

1) NMI_SERR associated with NMI
2) NMI_IO_CHECK associated with IO errors
3) NMI_UNKNOWN NMI unclaimed by all local handlers.

On Gen10 going forward we plan to restrict to just iLO
generated NMIs.

There is a long history on hp/hpe proliant systems where hpwdt
was handler of general IO errors (at least ones that would cause
NMI to be generated) and we chose to panic in these situation
as the errors were generally quite serious.

Yes, this has caused some problems in the past as Linux has
overloaded NMI and some subsystems didn't claim the NMIs that
they generated (think profiling.) But, I haven't seen these
types of problems for several years now.

The more modern platforms have more robust error handling built
into them and to linux so going forward we'll restrict hpwdt to a more
traditional WDT role. But we're retaining the more conservative
approach for legacy platforms.

How would you suggest that the message be enhanced?


> * Watchdog provided by hpwdt is not working after such panic.
>
> Here are the patches that should fix this.
>
> This is an RFC patch series because I am not sure that patches are
> correct. Questions:
>
> * Are "mynmi" flags always set on all supported iLO versions when iLO
> is the source of NMI?


Unfortunately no.

hpwdt is a dual purpose driver. It handles the iLO watchdog timer
and the "Generate NMI to System" button. These are closely related
hardware wise.

However, some platforms generate NMI for "Generate NMI to System" button but aren't
signaled via iLO registers. These will show up as NMI_UNKNOWN, hence while
hpwdt still claims these.

There are also some systems that do not set the nmistat bits correctly.

So as to not break legacy platforms, the use the nmistat bits for
control will be for Gen10 going forward.



> * Is it safe to reset "mynmi" flags to zero if code decides to not panic?

The reading of the registers is itself destructive (sets to zero) but the real
issue is that some proliant systems lack the ability to acknowledge the NMI so
only one can ever be received. So returning is not advisable as no
further NMI will be generated via this path. A reset through firmware
is required to restore the feature.


>
> Ivan Mironov (4):
> watchdog: hpwdt: Don't disable watchdog on NMI
> watchdog: hpwdt: Don't panic on foreign NMI
> watchdog: hpwdt: Add more information into message
> watchdog: hpwdt: Make panic behaviour configurable
>
> drivers/watchdog/hpwdt.c | 45 ++++++++++++++++++++++------------------
> 1 file changed, 25 insertions(+), 20 deletions(-)
>
> --
> 2.20.1

--

-----------------------------------------------------------------------------
Jerry Hoemann Software Engineer Hewlett Packard Enterprise
-----------------------------------------------------------------------------