Re: PROBLEM: Filesystem corruption under 2.2.12

Theodore Y. Ts'o (tytso@mit.edu)
Wed, 1 Sep 1999 09:14:02 -0400


Date: Wed, 01 Sep 1999 02:01:37 +0300
From: Touko Korpela <tkorpela@gnwmail.com>

[2.] Full description of the problem/report:

Under Linux 2.2.12 (and 2.2.11) one directory gets
mysteriously corrupted always when some cron-job
scans it. This didn't happen under 2.0.x. I first
suspected HD's DMA-mode but disabling it didn't help.
The message is:
EXT2-fs error (device ide0(3,5)): ext2_readdir: bad
entry in directory #18682: rec_len is smaller than
minimal - offset=0, inode=393222, rec_len=7, name_len=6

What happens if you unmount the filesystem and run e2fsck -Ff /dev/hda5
twice, back to back? Do you see the error again after the second time
you run e2fsck?

Looking at the directory dumps, it looks like starting at offset 014000
of your directory, a block was replaced by garbage. This could either
be because garbage is getting written into the block somehow (due to
a kernel bug of some kind), or it could just be that you have a bad
block at that location, which isn't accepting writes. (The -F flag
flushes the buffer cache before running e2fsck, to avoid the case where
the directory has been "corrected" in memory, but for some reason the
disk isn't accepting the correction.)

Another thing you might want to try is using the debugfs "ncheck"
command to determine the name of the directory which is repeatedly
getting corrupted (I assume it's the same directory each time?). Also,
it would be useful to try doing a dump of the directory after multiple
cases of corruption to see if the corruption is happening in the same
place each time.

Also, is /dev/hda5 your Linux boot partition by any chance, and do you
use LILO to boot? If so, double check to make sure the physical and
logical partition boundaries match correctly. I've seen one case where
they didn't match, and since LILO uses the c/h/s partition entries to
determine where to update the LILO map sector, and LILO writes to the
map sector during the boot process to clear out the one-time
kernel-to-boot field set by "lilo -R", you can get a very hard to debug
situation where LILO was reliably writing to the wrong block, and thus
trashing whatever data happened to be there. This can be a very hard to
debug situation! I'm not saying that's what happened in your case, but
it's something that might be worth checking.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/