Re: PROBLEM: Fatal Machine Check >= 3.13.5-101.fc19.x86_64

From: Borislav Petkov
Date: Fri Mar 21 2014 - 16:14:15 EST


+ Tony.

Provided the decode is correct and I'm reading it right, this looks
like the cores get to livelock for some reason without any forward
progress. The MCEs signal that there hasn't been any instruction retired
in relatively long time, thus a stall.

You say, this happens when gnome starts. Does it also happen if you
don't start gnome, i.e. don't start X at all? Try booting into a
runlevel without graphics.

Tony, any other ideas?

Also, can you send full dmesg of both a working boot, without the MCEs
and one with?

Leaving in the rest.

On Fri, Mar 21, 2014 at 08:49:51PM +0100, Matthias Graf wrote:
> (Please CC me on all replies)
>
> mcelog output for all mces:
>
>
>
> Hardware event. This is not a software error.
> CPU 3 BANK 0
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
> Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 5
>
>
> Hardware event. This is not a software error.
> CPU 3 BANK 5
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200220024080400 MCGSTATUS 5
>
>
> Hardware event. This is not a software error.
> CPU 1 BANK 0
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
> Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 4
>
>
> Hardware event. This is not a software error.
> CPU 1 BANK 5
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200220010040400 MCGSTATUS 4
>
>
> Hardware event. This is not a software error.
> CPU 2 BANK 0
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
> Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 4
>
>
> Hardware event. This is not a software error.
> CPU 2 BANK 5
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200221010040400 MCGSTATUS 4
>
> Hardware event. This is not a software error.
> CPU 0 BANK 5
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200221024080400 MCGSTATUS 5
>
>
> Hardware event. This is not a software error.
> CPU 0 BANK 0
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
> Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 5
>
>
>
> Am 21.03.2014 18:27, schrieb Borislav Petkov:
> > On Fri, Mar 21, 2014 at 06:10:23PM +0100, Matthias Graf wrote:
> >> Please CC me on replies.
> >>
> >> [1.] Kernel panic: Fatal Machine Check after booting >=
> >> 3.13.5-101.fc19.x86_64; 3.12.11-201.fc19.x86_64 works fine!
> >> [2.] Screen freezes a few seconds after Gnome appears. The error message
> >> (see attachement) is seldom still printed to the screen. Booting
> >> 3.12.11-201 with otherwise the same setup, I do not see the panic.
> >> Booting on different hardware (my laptop) does not produce the panic. I
> >> also notice low frames per seconds after gnome started up, right before
> >> the panic occures. I therefore suppose this is graphics hardware related.
> >> [3.] Fatal Machine Check Exception, RIP Inexact, apic_timer_interrupt,
> >> Kernel panic
> >> [4.] 3.13.6-100.fc19.x86_64 && 3.13.5-103.fc19.x86 && 3.13.5-101.fc19.x86_64
> >> [5.] OCRed: (see Attachement for photo)
> >>
> >> Started Accounts Service.
> >> [ 34.348483] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 8: bZ88884888888888
> >> [ 44.468168] mce: [Hardware Error]: HIP ?IHEXïCT? 18:<ffffffff816881f8> {apicgtimer_interrupt+8x8/8x88}
> >> I 44.468168] mce: [Hardware Error]: TSC 36S??8ad8c
> >> f 44.468168] mce: [Hardware Error]: PROCESSOR 8:6fb TIM 138471666? SOCKET 8 HPIC 2 microcode ba
> >> I 44.468168] mce: [Hardware Error]: Run the above through 'mcelog ~~asciiâ
> >
> > This looks like you had some text recognition done on the jpeg. :-)
> >
> > Please correct the error message to be exactly as in the jpeg and run it
> > through mcelog --ascii to see what that bank 8 is trying to tell us.
> >
> > Thanks.
> >

> [ 34.348483] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200004000000800
> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
> [ 44.468168] mce: [Hardware Error]: TSC 365779ad0c
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: b200220024080400
> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
> [ 44.468168] mce: [Hardware Error]: TSC 365779ad0c
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 0: b200004000000800
> [ 44.468168] mce: [Hardware Error]: TSC 365779ad42
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 5: b200220010040400
> [ 44.468168] mce: [Hardware Error]: TSC 365779ad42
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 0: b200004000000800
> [ 44.468168] mce: [Hardware Error]: TSC 365779aeaa
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 5: b200221010040400
> [ 44.468168] mce: [Hardware Error]: TSC 365779aeaa
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: b200221024080400
> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
> [ 44.468168] mce: [Hardware Error]: TSC 365779aece
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 0: b200004000000800
> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
> [ 44.468168] mce: [Hardware Error]: TSC 365779aece
> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba
> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 44.468168] mce: [Hardware Error]: Machine check: Processor context corrupt
> [ 44.468168] Kernel panic â not syncing: Fatal Machine check
> [ 44.468168] drm_kms_helper: panic occurred, switching back to text console
> [ 44.468168] Rebooting in 30 seconds..

> Hardware event. This is not a software error.
> CPU 3 BANK 0
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 5
>
>
> Hardware event. This is not a software error.
> CPU 3 BANK 5
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200220024080400 MCGSTATUS 5
>
>
> Hardware event. This is not a software error.
> CPU 1 BANK 0
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 4
>
>
> Hardware event. This is not a software error.
> CPU 1 BANK 5
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200220010040400 MCGSTATUS 4
>
>
> Hardware event. This is not a software error.
> CPU 2 BANK 0
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 4
>
>
> Hardware event. This is not a software error.
> CPU 2 BANK 5
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200221010040400 MCGSTATUS 4
>
> Hardware event. This is not a software error.
> CPU 0 BANK 5
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200221024080400 MCGSTATUS 5
>
>
> Hardware event. This is not a software error.
> CPU 0 BANK 0
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
> timeout BINIT (ROB timeout). No micro-instruction retired for some time
> STATUS b200004000000800 MCGSTATUS 5
>




--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/