Re: NMI error and Intel S5000PSL Motherboards

From: Randy Dunlap
Date: Tue Sep 25 2007 - 23:00:03 EST


On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

> We have about 100 servers based on Intel S5000PSL-SATA motherboards.

product info (for others):
http://support.intel.com/support/motherboards/server/s5000psl/index.htm

> They have been running for anywhere between 1 and 10 months. For the
> past few months, after updating them all to the 2.6.20.15 kernel
> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI
> errors. For example:
>
> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode
> enabled?
> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
>
> Sometimes these errors cause a total system freeze. Most of the time the
> systems keep running.
>
> We have determined these errors come most frequently on machines that
> have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE
> these cards (it doesn't matter what slot they are in), the NMI errors
> can occur as frequently as every 3-5 minutes. On machines that do NOT
> have these Quad Port Adapters, the NMI errors occur about once per month
> on average. (we have tried the "in-kernel" e1000 drivers, as well as
> Intel's latest - 7.6.5).
>
> We have also determined (through a chance discovery) that running
> âscanpciâ can 100 percent reliably reproduce the NMI error on any
> machine that has the Quad Port NICS. Our various motherboards have
> different Intel BIOS versions â some have Rev 70, others 74, 79 or 81.
> They all exhibit the same behavior regardless of BIOS version.
>
> We have reproduced this problem with:
>
> Mandriva 2008 RC2 (2.6.22 kernel)
> Mandriva 2007 with custom 2.6.20.15 kernel
> Mandriva 2007 with custom 2.6.19.8 kernel
> Ubuntu âFeistyâ with 2.6.20 kernel
> Fedora Core 7 with 2.6.22 kernel
>
> The problem does NOT occur with any distribution running a 2.6.18 kernel
> or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included
> 2.6.17 kernel or custom-compiled 2.6.18 kernel.
>
> We have been in contact with Intel. Their high level tech support people
> have basically said,
>
> âthe errors we have logged so far are pointing to a kernel issue and
> not a hardware problem. If we [Intel] can confirm this, it will be
> up to the kernel developer or OS system manufacturer to debug those
> ones, as we do not perform Operating system support.â
>
> In other words, Intel seems to be blaming the problem we are seeing on
> something introduced starting with the 2.6.19 kernel. We are not looking
> to blame anybody. We are only looking for a solution.
>
> Does anybody have an idea what could be going on here, as well as what
> the solution may be? Going back to 2.6.18 or lower is not an option.


Please provide some basic info, like:

- how much RAM
- what CPUs (be precise: use 'cat /proc/cpuinfo')
- output of 'lspci -v'
- what kind(s) of SATA drives
- are you using 32-bit or 64-bit kernel(s)

Can you test kernels from kernel.org (i.e., not vendor kernels,
no other [unkwown] patches applied to them)?

Does tracing 'scanpci' produce any helpful information?
# strace -o scanpci.trace scanpci


---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/