MCE problem on dual Opteron

From: Martin Drab
Date: Thu Aug 04 2005 - 08:58:30 EST


Hi,

I get the following problem with 2.6.13-rc5-git1 on a dual Opteron
machine:

---------
...
[ 847.745921] CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
[ 847.746066] RIP 10:<ffffffff802c04ee> {pci_conf1_read+0xbe/0x110}
[ 847.746149] TSC 189fe311d3f ADDR fdfc000cfe
[ 847.746218] Kernel panic - not syncing: Uncorrected machine check
---------

This appears during bootup and it hangs. So my question is: Is this a HW
problem or is it some kernel (MCE ?) bug? If it is a HW problem is it
possible to determine what's wrong somehow?

The above mentioned output I get also from 2.6.13-rc4-git4 and 2.6.12.3.
When I run the original FC4 kernel 2.6.11-1.1369_FC4smp I get the same
followed by the following call trace:

---------
Call Trace: <#MC> <ffffffff80139195>{panic+133}
<ffffffff80115e1f>{print_mce+159} <ffffffff80115ed9>{mce_panic+137}
<ffffffff801165b4><do_machine_check+852}
<ffffffff802e8f5e>{pci_conf1_read+190}
<ffffffff802e8f5e>{pci_conf1_read+190}
<ffffffff8010fe7f>{machine_check+127}
<ffffffff801f2c60>{selinux_d_instantiate+0}
<ffffffff802e8f5e>{pci_conf1_read+190} <EOE>
<ffffffff80541f97>{pci_direct_init+119} <ffffffff8010c232>{init+482}
<ffffffff8010f76b>{child_rip+8} <ffffffff8010c050>{init+0}
<ffffffff8010f763>{child_rip+0}
--------

Interesting is, that FC4 automatically sets the 'nomce' option to the
kernel command line by default (which leads me to that it may actually
be a bug in the kernel). And when 'nomce' is used the system boots
and runs quite normally.

Only recently with 2.6.12.3 (which the box was running past few
months) from time to time (so far it happend 3 times in about a month) the
box completly stops responding to the outside world (no network, display
turns off (no signal), USB keyboard and mouse both go dead, however the
comp isn't turned off because for instance the disks are still normally
flashing with the LEDs, but that may be due to the intelligent LSI 1030
controller with its own independent processor), so basically the box is
dead to te outside world. There's nothing unusual in the kernel logs. The
only thing that may be a result of that is that the IPMI server management
card registers the following 4 system events, however I'm not very clever
from that:

---------
1)
SEL Entry Number = 5
SEL Record ID = 0050
SEL Record Type = 02 - System Event Record
Timestamp: 3.8.2005 02:31:59
Generator ID: 21 00
SEL Message Rev = 04
Sensor Type = 20 - OS Critical Stop
Sensor Number = 41 (unknown)
SEL Event Type = 6F - Sensor-specific, Assertion
SEL Event Data = A1 69 65
2)
SEL Entry Number = 6
SEL Record ID = 0060
SEL Record Type = 0F - OEM Defined
Timestamp:
Generator ID: 65 65
SEL Message Rev = 2C
Sensor Type = 20 - OS Critical Stop
Sensor Number = 6B - (unknown)
SEL Event Type = 69
SEL Event Data = 6C 6C 69
3)
SEL Entry Number = 7
SEL Record ID = 0070
SEL Record Type = 0F - OEM Defined
Timestamp:
Generator ID: 20 69
SEL Message Rev = 6E
Sensor Type = 74
Sensor Number = 65 - (unknown)
SEL Event Type = 72
SEL Event Data = 72 75 70
4)
SEL Entry Number = 8
SEL Record ID = 0080
SEL Record Type = 0F - OEM Defined
Timestamp:
Generator ID: 68 61
SEL Message Rev = 6E
Sensor Type = 64
Sensor Number = 6C - (unknown)
SEL Event Type = 65
SEL Event Data = 72 21 00
---------

Interesting is, however, that while the timestamp in the above event log
says 3.8.2005 02:31:59, when I look into the /var/log/messages it looks
like this:

---------
Aug 3 02:25:01 neutron crond(pam_unix)[6257]: session opened for user root by (uid=0)
Aug 3 02:25:02 neutron crond(pam_unix)[6257]: session closed for user root
Aug 3 02:30:01 neutron crond(pam_unix)[6299]: session opened for user root by (uid=0)
Aug 3 02:30:01 neutron crond(pam_unix)[6300]: session opened for user root by (uid=0)
Aug 3 02:30:01 neutron crond(pam_unix)[6300]: session closed for user root
Aug 3 02:30:01 neutron crond(pam_unix)[6299]: session closed for user root
Aug 3 02:35:01 neutron crond(pam_unix)[6344]: session opened for user root by (uid=0)
Aug 3 02:35:01 neutron crond(pam_unix)[6344]: session closed for user root
...
Aug 3 04:01:02 neutron crond(pam_unix)[8132]: session closed for user root
Aug 3 04:02:01 neutron crond(pam_unix)[8171]: session opened for user root by (uid=0)
Aug 3 18:03:54 neutron syslogd 1.4.1: restart.
Aug 3 18:03:54 neutron kernel: klogd 1.4.1, log source = /proc/kmsg started.
---------

So basically there are logs up until 04:02:01 and then the whole day
nothing (which is strange) until at 18:03:54 I hit the reset button.

I'll be very glad if anyone could tell me what's going on.

Thanks,
Martin

P.S.: The system is an MSI MS-9245 (AMD8131+8111) with 2xOpteron 246 and
2 GB of ECC memory.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/