Re: NMI watchdog + NOHZ question

From: Andi Kleen
Date: Wed Jun 24 2009 - 06:52:34 EST


On Wed, Jun 24, 2009 at 03:32:33AM -0700, David Miller wrote:
> From: Andi Kleen <andi@xxxxxxxxxxxxxx>
> Date: Wed, 24 Jun 2009 12:23:25 +0200
>
> >> And similarly to sparc64, if that 5+ second qla2xxx interrupt
> >> sequence happens after the tick_nohz_stop_sched_tick() call
> >> we can run into the same situation.
> >
> > Yes it would be probably safer to do the tick disabling with
> > interrupts off already.
>
> That only makes sense if you're really putting the cpu to sleep
> until an interrupt or similar happens.

That is what the idle loop is supposed to do, isn't it?

> > These days NMI watchdog is not used much on x86 anymore because it's
> > default off, so probably people never noticed that.
>
> I really didn't want to provide the feature that way on sparc64 which
> is why I made it on by default. It would be interesting to reconsider
> x86's default, perhaps even only on a trial basis in -next.

The reason it was turned off is that there are a few systems (e.g.
laptops from a particular vendor) which don't handle NMIs correctly
in the platform. When the NMI happens while SMI is active
they hang. Also there were a few other strange problems
on other systems that went away when it was disabled.

One way to handle all that would be to have a big NMI white/black
list for specific systems. That would be useful because there are
a few cases where NMIs are really useful: one example right now
is panic which is currently unable to stop other CPUs not
enabling interrupts.

But creating and maintaining such a list would be a lot of
work (at least initially), and so far nobody was interested
enough to do that.

When you don't have as many different platforms and vendors
things are a lot easier.

>
> It's so useful, and in the short time sparc64 has had this NMI code I
> can count at least 8 bugs I've fixed only because it was on all the
> time.

Yes when it was still on it also found bugs. On the other hand once
it is default one the number of new bugs you find with it goes
down quite fast.

-Andi

--
ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/