Re: Machine check exception with a kernel dependency

From: Frank van Maarseveen
Date: Fri Feb 15 2008 - 09:50:25 EST


On Fri, Feb 15, 2008 at 01:22:41PM +0000, Alan Cox wrote:
> On Wed, 13 Feb 2008 17:25:28 +0100
> Frank van Maarseveen <frankvm@xxxxxxxxxxx> wrote:
>
> > On at least two Dell optiplex 755 systems with a Core 2 Duo I get
> >
> > Feb 13 15:14:01 inari CPU 1: Machine Check Exception: 0000000000000004
> > Feb 13 15:14:01 inari CPU 0: Machine Check Exception: 0000000000000005
> > Feb 13 15:14:01 inari Bank 0: b200004000000800
> > Feb 13 15:14:01 inari Bank 5: b200221024080400
> >
> > 2.6.22.10 shows the problem, 2.6.24.2 ditto but I'm unable to reproduce
> > it with 2.6.24-rc8. BIOS upgrade didn't help. Removing all PCI[e] cards
> > didn't help either.
>
> If you run the MCE numbers through a decoder what do you get back ?

I've some trouble decoding these in a convincing way. mcelog --core2
--ascii reports "MCG status:RIPV MCIP" for 0000000000000005 and "MCG
status:MCIP" for 0000000000000004.

I've collected several Bank # output lines:

# text
---------------------------
26 Bank 0: b200004000000800
10 Bank 5: b200121014040400
8 Bank 5: b200121020080400
4 Bank 5: b200221010040400
4 Bank 5: b200221024080400

but mcelog expects lines of the format

CPU %u: Machine Check Exception: %16Lx Bank %d: %016Lx

(they got broken by netconsole) so I made these up:

CPU 1: Machine Check Exception: 0000000000000004 Bank 0: b200004000000800
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121014040400
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121020080400
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221010040400
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221024080400

result:

CPU 1: Machine Check Exception: 0000000000000004 Bank 0: b200004000000800
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Originated-request Generic Memory-access Request-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout)
STATUS b200004000000800 MCGSTATUS 4

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121014040400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200121014040400 MCGSTATUS 5

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121020080400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200121020080400 MCGSTATUS 5

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221010040400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221010040400 MCGSTATUS 5

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221024080400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5


The problem also exists on an entirely different Xeon system with 4 cores:

cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU X3210 @ 2.13GHz
stepping : 11


--
Frank
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/