Re: nmi_watchdog=2 regression in 2.6.21

From: Stephane Eranian
Date: Tue Aug 28 2007 - 13:06:29 EST


Daniel,

On Tue, Aug 28, 2007 at 07:34:44AM -0700, Daniel Walker wrote:
> On Tue, 2007-08-28 at 02:12 -0700, Stephane Eranian wrote:
> > Daniel,
> >
> > On Mon, Aug 27, 2007 at 04:07:54PM -0700, Daniel Walker wrote:
> > > On Mon, 2007-08-27 at 15:55 -0700, Stephane Eranian wrote:
> > >
> > > > Yet the model name looks strange. So we need to run one more test,
> > > > as the fam/model is not enough. What we need to check is whether or
> > > > not this processor implements architectural perfmon or not.
> > > >
> > > > Could you please compile and run the attached program and send me
> > > > the output?
> > >
> > > The output below is all the output ..
> > >
> > > eax=0x7280201: version=1 num_cnt=2
> > >
> > Then you have a Core Duo processor and the commit from Bjorn should
> > fix the problem. If it does not, then there is something else wrong.
> > Unfortunately, I do not have a Core Duo machine to try and reproduce.
>
> There must be something else wrong, cause the problem persists .. As I
> said in past emails to Bjorn, I tested his commit in git, as well as the
> latest git all with the same issue (as well as bisecting git)..
>
> If the hardware is buggy then we need some way to determine that..
>
Could you instrument check_nmi_watchdog() to verify that you terminate
this function? Normally there is a safety mechanism in there.

Another possibility is that you get flooded with NMI interrupts and
do not make forward progress.

> If this machine didn't support performance counters, what would happen
> then?
>

If you have an Local APIC and performance counters, then it will try and use it.
Otherwise, I suspect it tries the NMI_IO_APIC (nmi_watchdog=1).

--
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/