MCE triggered with v3.1 and v3.2 on Xeon E5

From: Arnaud Lacombe
Date: Fri Mar 30 2012 - 12:15:19 EST


Hi,

I am having some trouble with Linux v3.1 and v3.2[0] on a machine
based on an E5-1650. I want to bench this platform with hackbench, but
both kernel crashes on the following MCE:

[49922.326743] mce_notify_irq: 7 callbacks suppressed^M
[49922.331612] [Hardware Error]: Machine check events logged^M
[49922.354705] [Hardware Error]: Machine check events logged^M
[49954.962291] Disabling lock debugging due to kernel taint
[...]
[49955.606532] [Hardware Error]: CPU 0: Machine Check Exception: 0
Bank 5: 8c00004000010093^M
[49955.614797] [Hardware Error]: TSC 0 ADDR 2c0cb9c0 MISC 2048008086 ^M
[49955.621125] [Hardware Error]: PROCESSOR 0:206d7 TIME 1333100255
SOCKET 0 APIC 0^M
[49955.628603] [Hardware Error]: CPU 0: Machine Check Exception: 0
Bank 5: 8c00004000010093^M
[49955.636844] [Hardware Error]: TSC 0 ADDR 2c0ca0c0 MISC 214074f486 ^M
[49955.643163] [Hardware Error]: PROCESSOR 0:206d7 TIME 1333100257
SOCKET 0 APIC 0^M
[49955.650642] [Hardware Error]: CPU 0: Machine Check Exception: 0
Bank 5: 8c00004000010093^M
[49955.658857] [Hardware Error]: TSC 0 ADDR 2c0cc5c0 MISC 2140424286 ^M
[49955.665237] [Hardware Error]: PROCESSOR 0:206d7 TIME 1333100263
SOCKET 0 APIC 0^M
[49955.672706] [Hardware Error]: CPU 0: Machine Check Exception: 0
Bank 5: cc00010000010093^M
[49955.680938] [Hardware Error]: TSC 0 ADDR 2c0cd8c0 MISC 2048121286 ^M
[49955.687249] [Hardware Error]: PROCESSOR 0:206d7 TIME 1333100263
SOCKET 0 APIC 0^M
[49955.694710] [Hardware Error]: CPU 0: Machine Check Exception: 0
Bank 5: 8c00004000010093^M
[49955.702949] [Hardware Error]: TSC 0 ADDR 2c0ca0c0 MISC 14074f486 ^M
[49955.709190] [Hardware Error]: PROCESSOR 0:206d7 TIME 1333100267
SOCKET 0 APIC 0^M
[49955.716653] [Hardware Error]: CPU 0: Machine Check Exception: 0
Bank 5: 8c00004000010093^M
[49955.724895] [Hardware Error]: TSC 0 ADDR 2c0c81c0 MISC 2148008086 ^M
[49955.731248] [Hardware Error]: PROCESSOR 0:206d7 TIME 1333100267
SOCKET 0 APIC 0^M
[49955.738709] [Hardware Error]: CPU 0: Machine Check Exception: 5
Bank 5: fe00004000010093^M
[49955.746953] [Hardware Error]: RIP !INEXACT! 60:<00000000c1013a4c>
{mce_rdmsrl+0x6c/0x110}^M
[49955.755344] [Hardware Error]: TSC 917c4c8538b6 ADDR 2c0cabc0 MISC
20400e0e86 ^M
[49955.762640] [Hardware Error]: PROCESSOR 0:206d7 TIME 1333100269
SOCKET 0 APIC 0^M
[49955.770080] [Hardware Error]: Machine check: Processor context corrupt^M
[49955.776673] Kernel panic - not syncing: Fatal Machine check^M
[49955.782350] Pid: 21421, comm: hackbench Tainted: G M W 3.1.0 #1^M
[49955.788953] Call Trace:^M
[49955.791486] [<c1656f4a>] ? printk+0x18/0x1a^M
[49955.795832] [<c1656e3c>] panic+0x57/0x14d^M
[49955.800060] [<c1013c62>] mce_panic+0x172/0x1a0^M
[49955.804668] [<c1014d32>] do_machine_check+0x812/0x820^M
[49955.809876] [<c16591c1>] ? __slab_free+0x1d/0x214^M
[49955.814732] [<c1014520>] ? mce_process_work+0x10/0x10^M
[49955.819961] [<c16604aa>] error_code+0x5a/0x60^M
[49955.824486] [<c1014520>] ? mce_process_work+0x10/0x10^M
[49955.829708] [<c16591c1>] ? __slab_free+0x1d/0x214^M
[49955.834577] [<c108beb8>] ? rcu_irq_exit+0x8/0x10^M
[49955.839369] [<c1040917>] ? irq_exit+0x37/0x90^M
[49955.843888] [<c1015afd>] ? smp_threshold_interrupt+0x2d/0x30^M
[49955.849733] [<c16603aa>] ? threshold_interrupt+0x2a/0x30^M
[49955.855210] [<c10ded05>] kfree+0xf5/0x120^M
[49955.859370] [<c14fb850>] ? skb_release_data+0xa0/0xc0^M
[49955.864591] [<c1598e6a>] ? unix_destruct_scm+0x7a/0x80^M
[49955.869885] [<c14fb850>] ? skb_release_data+0xa0/0xc0^M
[49955.875104] [<c14fb850>] skb_release_data+0xa0/0xc0^M
[49955.880131] [<c14fb882>] __kfree_skb+0x12/0x90^M
[49955.884744] [<c14fb92a>] consume_skb+0x2a/0x80^M
[49955.889364] [<c159ad23>] unix_stream_recvmsg+0x1e3/0x580^M
[49955.894868] [<c1057180>] ? abort_exclusive_wait+0x80/0x80^M
[49955.900436] [<c14f4f54>] sock_aio_read+0x114/0x130^M
[49955.905379] [<c10e1844>] do_sync_read+0xa4/0xe0^M
[49955.910072] [<c10e1d77>] ? rw_verify_area+0x67/0x120^M
[49955.915239] [<c105b574>] ? hrtimer_interrupt+0x154/0x250^M
[49955.920713] [<c10e2199>] vfs_read+0x149/0x160^M
[49955.925264] [<c10e2428>] sys_read+0x38/0x70^M
[49955.929600] [<c165fa5d>] syscall_call+0x7/0xb^M
[49955.934141] [<c1650000>] ? early_init_intel+0x27/0x152^M
[49957.015689] Rebooting in 30 seconds..

Processor is identified as:

# head -40 /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz
stepping : 7
microcode : 0x708
cpu MHz : 1200.000
cache size : 12288 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts
dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 6384.73
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

The BIOS on this platform is fairly recent:

Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F BIOS Date:02/10/2012 Rev:1.00

Currently, I would suspect an hardware issue as the machine is brand
new. I'll see if v3.3 trigger the same MCE and eventually run a
memtest.

Any hints and/or suggestion appreciated.

Thanks,
- Arnaud

[0]: other kernel might be affected, but I do not have explicit logs
of crashes on those.

Attachment: E5-1650-dmesg
Description: Binary data