Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

From: Jonathan Kamens
Date: Wed Jan 14 2004 - 21:22:28 EST


Hello everyone,

I'd like to provide an update on my efforts to understand what causes
"DriveStatusError BadCRC" errors when using UDMA drives, how to debug
these errors in general, the specific progress I've made at resolving
these errors on my system, and subsequent problems I've encountered
when doing so.

Recall that I was getting these errors from 2.4.22-ac4 on my
dual-processor (550MHz Pentium III Katmai) system:

hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }

from a Seagate 160GB drive (ST3160021A) plugged into its own channel
on a Promise Ultra66 (PDC20262) controller.

The suggestion most frequently given to me and others for resolving
BadCRC errors is to replace the IDE cable with one that conforms to
the Ultra ATA spec (80-conductor, flat, two drive connectors, single
drive connected to the end connector). I tried several such cables,
none of which made the BadCRC errors go away.

Other suggestions given to me included:

* Make sure the IDE cable is not running parallel to another cable.
* Make sure the cable is not passing near magnets inside the case,
e.g., speaker magnets.
* Update the IDE controller's firmware.
* Check to make sure the PCI bus speed is valid (33MHz, normally).
* Make sure the PCI latency timer is set in the BIOS to at least 64.

I tried all of these suggestions, and none of them worked.

I tried swapping the drives on the controller's two channels, and the
BadCRC errors traveled with the drive. Then I swapped the cables on
the two channels, and the errors still remained on the same drive.

Next, I bought a SIIG Ultra ATA 133 controller, compiled SIIMAGE
support into my kernel, plugged in the new SIIG controller, and moved
the drive getting the BadCRC errors over to it. They stopped -- I
haven't seen a single BadCRC error since I moved the drive to the SIIG
controller a couple of weeks ago.

Alas, another problem has presented itself. Twice after I installed
the SIIG controller and moved the Seagate drive to it, my system hung
(all activity seemed to stop, syslogd stopped logging, X server
stopped responding, couldn't switch VTs). Both times, Alt-SysRq-s and
Alt-SysRq-u appeared to have no effect, but Alt-SysRq-b successfully
rebooted the system. I couldn't get any more information because I
don't have a serial console and my monitor was in X when the hang
happened; since I couldn't switch VT's I couldn't get to one where the
magic SysRq sequences would display information.

After the second hang, I tried two more things -- moving the other
drive to the SIIG controller, such that the Promise controller no
longer has any drives on it (but it's still plugged in, and also, my
motherboard's PIIX4 controller still has a hard drive, CD-ROM and
OnStream DI-30 drive plugged into it as hda, hdc and hdd
respectively), and turning off unmask IRQ for the drives on the SIIG
controller, as suggested in other messages here. Unfortunately, even
with these two additional steps, I'm still seeing kernel hangs, albeit
seemingly less frequently -- I just had another one about an hour ago.

I've just enabled the NMI watchdog, compiled software watchdog support
into my kernel and installed and enabled the watchdog daemon. If
anyone can suggest anything else I can do to debug these hangs, I'm
all ears.

Thanks for reading this far. :-)

Jonathan Kamens
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/