increasing APIC errors & spontaneous reboots

From: Steve Snyder
Date: Sun Nov 09 2003 - 11:19:25 EST


Hello.

I'm seeing an increasing number of APIC errors (variable irq_mis_count,
displayed as the "MIS:" value in /proc/interrupts) and I'm also seeing
spontaneous reboots. Could these be related?

Before I go into details, a little background. I've been using this same
motherboard/CPUs/RAM with the 2.4.x kernels for several years with no
problems. Oh, the MIS: count would climb to 2, maybe as high as 4 but as
no problems were experienced I never much cared. The hardware being used
is a Giga-Byte GA-6BXDU motherboard, dual P3/850Mhz processors and 3
256MB sticks of ECC PC100 RAM.

For the last 6 - 8 months I've been seeing the MIS count increasing. The
largest I've seen is a value of 20. During the same period I've been
seeing spontaneous reboots. The system will reboot itself every 3 to 4
weeks. There's no great uptime here; those APIC errors accumulate in
less than a month.

No problems are seen in the system logs for this machine. Things are
apparently going along fine, then the next log entry is the starting line
of a kernel loading. The system has rebooted. (I'm administering this
machine remotely, so I can't report any text shown on the screen.)

I initially suspected the problems were heat-related (no, I do not
overclock anything) but lm_sensors consistently shows the CPU fan
rotations and temperatures to be of reasonable values.

Although this generally is a Red Hat v7.3 system, I run the latest stable
plain-vanilla Linux kernel, not RH's patched versions. I am currently
running v2.4.22, with v2.4.21 before that.

Can anyone explain to me what would cause the increasing APIC MIS counts?
Also, could this cause the system to reboot?

Thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/