RE: "exception Emask: 0x42" errors with 2.6.22.x and SATA drives

From: Dermot Bradley
Date: Mon Aug 27 2007 - 05:28:30 EST


> FWIW, I've got the HDMI version of this board and I have exactly the
same
> problem (even with the newest BIOS) if nmi_watchdog is not set to
zero.
> Try booting with nmi_watchdog=0 (default on x86-64, I think) and see
if
> these go away.
>
> I guess the APIC has some difficulties handling NMIs.

On Friday I had rebooted the box (remotely) after adding "noapic" to the
boot options and the machine didn't come back up - came in this morning
to see a kernel panic after the reboot to do with nmi_watchdog.

This morning I've removed the "noapic" option and instead set
"nmi_watchdog=0" and its booted and been up and running for 30 minutes
with no APCI whereas these happened every thing previously during boot.

So nmi_watch does indeed seem to related to the root cause of the
problem.

Thanks for the help Alistair! One other point you may be able to help
with - this is the first time I've used a dual core processor and I
expected that /proc/interrupts would should interrupts distributed
between both cores whereas they actually seem to be mainly handled by
the 1st core:

CPU0 CPU1
0: 251 0 IO-APIC-edge timer
1: 2208 11 IO-APIC-edge i8042
8: 0 1 IO-APIC-edge rtc
9: 0 0 IO-APIC-fasteoi acpi
16: 5291 3 IO-APIC-fasteoi ehci_hcd:usb1, eth0
17: 223026 13 IO-APIC-fasteoi ahci
18: 0 1 IO-APIC-fasteoi ohci_hcd:usb2
19: 0 126 IO-APIC-fasteoi ohci_hcd:usb3,
ohci_hcd:usb5
20: 0 2 IO-APIC-fasteoi ohci_hcd:usb4,
ohci_hcd:usb6
21: 188591 1 IO-APIC-fasteoi HFC-multi
NMI: 0 0
LOC: 3036393 1527288
ERR: 0
MIS: 0

Is this to be expected for dual core systems?

> I get the feeling this problem is independent of the APIC errors, and
I
> don't see it here

Yeah, I thought as much. After disabling NMI Watchdog I didn't see these
NCQ messages initially but after running a few bonnie runs to load the
disks I'm now getting them again.

> As Alan said, it's very possibly just the drive not properly
supporting
> NCQ.

Yeah that seems the case - last Friday these errors seemed to be
occurring every 1-2 hours or so even when I wasn't using the machine.

I'm currently rebuilding the kernel with the following patch to disable
NCQ for these drives:

--- linux-2.6.22.5/drivers/ata/libata-core.c 2007-08-23
00:23:54.000000000 +0100
+++ linux-2.6.22.5.new/drivers/ata/libata-core.c 2007-08-27
10:25:16.000000000 +0100
@@ -3788,6 +3788,7 @@
/* NCQ is broken */
{ "Maxtor 6L250S0", "BANC1G10", ATA_HORKAGE_NONCQ },
{ "Maxtor 6B200M0", "BANC1B10", ATA_HORKAGE_NONCQ },
+ { "WDC WD800BEVS-07", NULL, ATA_HORKAGE_NONCQ },
/* NCQ hard hangs device under heavier load, needs hard power
cycle */
{ "Maxtor 6B250S0", "BANC1B70", ATA_HORKAGE_NONCQ },
/* Blacklist entries taken from Silicon Image 3124/3132




Stirk, Lamont & Associates Ltd.
Registered Address: Thomas Andrews House, Queens Road, Belfast, BT3 9DU
Registered in Northern Ireland, Number: NI 47983. VAT Number: 832 2778 22

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/