RE: MCE problem on dual Opteron

From: Martin Drab
Date: Thu Aug 04 2005 - 10:16:55 EST


On Thu, 4 Aug 2005, Roger Heflin wrote:

> If this does not happen immediately at boot up (before the machine
> finished all init stuff), it is generally a hardware problem. In
> my experience with new machines 75% of the time it will be the cpu
> itself, and another 25% it will be a serious memory error.

Unfortunatelly this seems to appear during bootup. Allways at the
exact same place. Attached is a dmesg output that precedes the below
mentioned MCE error. (BTW: The timestamp there isnt actually 847.7... but
84.7... I got it wrong when I was copying it from the screen by hand.) The
attached dmesg output is from the 'nomce' run of the kernel and it is
snipped so that it ends just where it ends without the 'nomce' option and
then follows the below mentioned MCE error.

I don't think it is a HW problem, or am I wrong?

Martin

>
> > -----Original Message-----
> > From: linux-kernel-owner@xxxxxxxxxxxxxxx
> > [mailto:linux-kernel-owner@xxxxxxxxxxxxxxx] On Behalf Of Martin Drab
> > Sent: Thursday, August 04, 2005 8:55 AM
> > To: Linux Kernel Mailing List
> > Subject: MCE problem on dual Opteron
> >
> > Hi,
> >
> > I get the following problem with 2.6.13-rc5-git1 on a dual Opteron
> > machine:
> >
> > ---------
> > ...
> > [ 847.745921] CPU 0: Machine Check Exception:
> > 7 Bank 3: b40000000000083b
> > [ 847.746066] RIP 10:<ffffffff802c04ee> {pci_conf1_read+0xbe/0x110}
> > [ 847.746149] TSC 189fe311d3f ADDR fdfc000cfe
> > [ 847.746218] Kernel panic - not syncing: Uncorrected machine check
> > ---------
> >
> > This appears during bootup and it hangs. So my question is:
> > Is this a HW problem or is it some kernel (MCE ?) bug? If it
> > is a HW problem is it possible to determine what's wrong somehow?
...[ 0.000000] Bootdata ok (command line is ro root=LABEL=/ resume=/dev/sda2 rhgb vga=794 ipmi_watchdog.start_now=0 ipmi_watchdog.timeout=0 nomce)
[ 0.000000] Linux version 2.6.13-rc5-git1 (root@xxxxxxxxxxxxxxxxxxxx) (gcc version 4.0.0 20050519 (Red Hat 4.0.0-8)) #2 SMP Wed Aug 3 19:41:13 CEST 2005
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009b000 (usable)
[ 0.000000] BIOS-e820: 000000000009b000 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000d0000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 000000007ff70000 (usable)
[ 0.000000] BIOS-e820: 000000007ff70000 - 000000007ff7b000 (ACPI data)
[ 0.000000] BIOS-e820: 000000007ff7b000 - 000000007ff80000 (ACPI NVS)
[ 0.000000] BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[ 0.000000] ACPI: RSDP (v002 PTLTD ) @ 0x00000000000f78e0
[ 0.000000] ACPI: XSDT (v001 PTLTD XSDT 0x06040000 LTP 0x00000000) @ 0x000000007ff78d11
[ 0.000000] ACPI: FADT (v003 AMD HAMMER 0x06040000 PTEC 0x000f4240) @ 0x000000007ff7ae0e
[ 0.000000] ACPI: HPET (v001 AMD HAMMER 0x06040000 PTEC 0x00000000) @ 0x000000007ff7af02
[ 0.000000] ACPI: MADT (v001 PTLTD APIC 0x06040000 LTP 0x00000000) @ 0x000000007ff7af3a
[ 0.000000] ACPI: SPCR (v001 PTLTD $UCRTBL$ 0x06040000 PTL 0x00000001) @ 0x000000007ff7afb0
[ 0.000000] ACPI: DSDT (v001 AMD-K8 AMDACPI 0x06040000 MSFT 0x0100000e) @ 0x0000000000000000
[ 0.000000] Scanning NUMA topology in Northbridge 24
[ 0.000000] Number of nodes 2
[ 0.000000] Node 0 using interleaving mode 1/0
[ 0.000000] No NUMA configuration found
[ 0.000000] Faking a node at 0000000000000000-000000007ff70000
[ 0.000000] Bootmem setup node 0 0000000000000000-000000007ff70000
[ 0.000000] On node 0 totalpages: 524144
[ 0.000000] DMA zone: 4096 pages, LIFO batch:1
[ 0.000000] Normal zone: 520048 pages, LIFO batch:31
[ 0.000000] HighMem zone: 0 pages, LIFO batch:1
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[ 0.000000] Processor #0 15:5 APIC version 16
[ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[ 0.000000] Processor #1 15:5 APIC version 16
[ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
[ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[ 0.000000] ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
[ 0.000000] IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
[ 0.000000] ACPI: IOAPIC (id[0x03] address[0xfc000000] gsi_base[24])
[ 0.000000] IOAPIC[1]: apic_id 3, version 17, address 0xfc000000, GSI 24-27
[ 0.000000] ACPI: IOAPIC (id[0x04] address[0xfc001000] gsi_base[28])
[ 0.000000] IOAPIC[2]: apic_id 4, version 17, address 0xfc001000, GSI 28-31
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
[ 0.000000] ACPI: IRQ0 used by override.
[ 0.000000] ACPI: IRQ2 used by override.
[ 0.000000] ACPI: IRQ9 used by override.
[ 0.000000] Setting APIC routing to flat
[ 0.000000] ACPI: HPET id: 0x102282a0 base: 0xfed00000
[ 0.000000] Using ACPI (MADT) for SMP configuration information
[ 0.000000] Allocating PCI resources starting at 80000000 (gap: 80000000:7ec00000)
[ 0.000000] Built 1 zonelists
[ 0.000000] Kernel command line: ro root=LABEL=/ resume=/dev/sda2 rhgb vga=794 ipmi_watchdog.start_now=0 ipmi_watchdog.timeout=0 nomce
[ 0.000000] Initializing CPU#0
[ 0.000000] PID hash table entries: 4096 (order: 12, 131072 bytes)
[ 0.000000] time.c: Using 14.318180 MHz HPET timer.
[ 0.000000] time.c: Detected 1994.064 MHz processor.
[ 94.964587] Console: colour dummy device 80x25
[ 94.967418] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
[ 94.971633] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[ 94.995579] Memory: 2055484k/2096576k available (2263k kernel code, 0k reserved, 1473k data, 220k init)
[ 95.144209] Calibrating delay using timer specific routine.. 3992.44 BogoMIPS (lpj=19962218)
[ 95.144251] Security Framework v1.0.0 initialized
[ 95.144263] SELinux: Initializing.
[ 95.144279] SELinux: Starting in permissive mode
[ 95.144286] selinux_register_security: Registering secondary module capability
[ 95.144291] Capability LSM initialized as secondary
[ 95.144305] Mount-cache hash table entries: 256
[ 95.144456] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
[ 95.144461] CPU: L2 Cache: 1024K (64 bytes/line)
[ 95.144465] CPU 0(1) -> Node 0 -> Core 0
[ 95.144470] mtrr: v2.0 (20020519)
[ 95.254026] Using local APIC timer interrupts.
[ 95.304130] Detected 12.462 MHz APIC timer.
[ 95.304237] Booting processor 1/2 APIC 0x1
[ 95.314544] Initializing CPU#1
[ 95.463830] Calibrating delay using timer specific routine.. 3987.63 BogoMIPS (lpj=19938185)
[ 95.463838] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
[ 95.463840] CPU: L2 Cache: 1024K (64 bytes/line)
[ 95.463842] CPU 1(1) -> Node 0 -> Core 0
[ 95.463968] AMD Opteron(tm) Processor 246 stepping 0a
[ 95.463979] CPU 1: Syncing TSC to CPU 0.
[ 95.464001] Brought up 2 CPUs
[ 95.464240] CPU 1: synchronized TSC with CPU 0 (last diff -23 cycles, maxerr 1124 cycles)
[ 95.464277] time.c: Using HPET based timekeeping.
[ 95.464289] testing NMI watchdog ... OK.
[ 95.564386] checking if image is initramfs... it is
[ 95.647764] NET: Registered protocol family 16
[ 95.647808] ACPI: bus type pci registered