Re: [PATCH 4/4] watchdog: configure nmi watchdog period based onwatchdog_thresh

From: Ingo Molnar
Date: Tue May 17 2011 - 03:16:59 EST



* Mandeep Singh Baines <msb@xxxxxxxxxxxx> wrote:

> Before the conversion of the NMI watchdog to perf event, the watchdog
> timeout was 5 seconds. Now it is 60 seconds. For my particular application,
> netbooks, 5 seconds was a better timeout. With a short timeout, we
> catch faults earlier and are able to send back a panic. With a 60 second
> timeout, the user is unlikely to wait and will instead hit the power
> button, causing us to lose the panic info.

That's an interesting observation. Have you been able to measure/observe this
effect somehow, or do you presume that users find 60 seconds too long?

This would be a concern for upstream as well i guess.

> This change configures the NMI period based on the watchdog_thresh.

Hm, our tolerance for the two thresholds is not just human but technical: hard
lockup warnings should indeed be triggered after just a few seconds, soft
lockups can have false positives under extreme conditions.

So we generally want a higher threshold for soft lockups than for hard lockups.

So how about we couple the thresholds with a factor: we make the soft threshold
twice the amount of time the hard threshold is? Then we could change the
upstream default as well i think: lets change the NMI timeout to 10 seconds
(and thus have the soft threshold at 20 seconds). Is 20 seconds short enough
for most users to not hit reset?

We might want to change another aspect of the NMI watchdog: right now it tries
to abort the offending task - which is really nasty if there was a spuriously
long irqs-off section somewhere in the kernel. How about we just print a
warning instead?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/