Re: hda problems

Theodore Y. Ts'o (tytso@mit.edu)
Wed, 4 Sep 1996 10:29:13 -0400


Date: Wed, 04 Sep 1996 00:24:30 -0500
From: "Bryan C. Andregg" <bandregg@idir.net>

Sep 4 00:14:09 hueco kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Sep 4 00:14:09 hueco kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=920040, sector=7732
Sep 4 00:14:09 hueco kernel: end_request: I/O error, dev 03:07, sector 7732

These are low-level errors from the disk driver, reporting errors from
the disk controller. What they mean is that you have hardware errors
developing on your drive. When this happens, *immediately* perform a
backup!! If you have a spare (empty) disk handy, do the backup by using
dd to copy the raw disk image.

Unfortunately, due to the way IDE disk technology fails, when you start
seeing these sorts of errors, very often they are a prelude to massive
disk failure. The disk head may have done a "micro-crash", and skipped
across the platter on those places, or some dirt managed to get past the
filters/seals on the disk, and has skipped across the platters, damaging
those disk sectors. Often, the debris left over from those events will
cascade through the platters, causing even more more damage, which
raises more debris..... If this is the case, you will see an
exponentially increasing number of disk block failures. It will be slow
at first, or even stop for a while; but eventually the debris will get
dislodged and cause more damage (especially true on laptop drives, in my
experience).

The reason why e2fsck hasn't found any errors yet is because the disk
block errors haven't corrupted any filesystem meta data yet. So far,
it's only corrupted blocks containing the data in files, like the RPM
database file in your case.

If you use the "badblocks" program, it will undoubtedly show you the bad
blocks, and you can feed them to e2fsck in an attempt to map out those
bad blocks, to prevent the kernel from trying to use those blocks ---
however, do a disk backup first!! There is a fairly good chance that
this may be a pre-warning signal to massive and complete disk failure.
Your disk may only have a limited number of "reads" left on it, and you
shouldn't waste it on using the badblocks program --- save your data
first, and only then start worrying about using programs like badblocks.

In fact, I will often treat the appearance of these failures (especially
on a disk that has a few years' worth of life on them) as a "Timmy,
Lassie's trying to tell us something" and take it as a hint that it's
time to replace the hard disk. IDE disks are relatively cheap these
days, and by the time a disk has had 3-5 years of hard life, it's about
exhausted its potential lifetime anyway. It's no accident that the
lifetime on Conner's drives was only recently extended to three
years.....

- Ted

P.S. Besides, the capacity/dollar of drives has been doubling every
12-18 months --- isn't it time you rewarded yourself with a new drive? :-)

P.P.S. If someone maintaining the Linux FAQ or HOWTO's would like to
include this text in one of the HOWTO's, feel free. You have my
permission to reproduce this as you see fit.