2.4.22-rc2 ext2 filesystem corruption

From: Martin Maney
Date: Mon Aug 11 2003 - 23:08:52 EST



Okay, further testing is clearly indicated (and I'm recompiling a test
kernel while writing this to try to narrow it down a little), but I've
got a very repeatable file corruption under 2.4.22-rc2 that does not
manifest under 2.4.21. My repeatable test case only (so far?) causes
the data in the file to be corrupted, but I suspect metadata can get
hit as well, and I have seen some filesystem errors that were probably
caused by this, but not so that I can say so with certainty.

The recipie is simple: cp a large file across filesystems. All looks
well (md5sums match, etc), but the file is all still present in memory.
But if you then unmount the destination filesystem to invalidate the
buffers, after mounting the file data will have changed. I'm pretty
certain that I have observed the same effect without the mass
invalidation of umount in a couple of cases, but I haven't replicated
that.

In all cases I have investigated, the corruption seems to take the form
of four bytes of garbage at the beginning of a block; two small scripts
have been observed with 4 NULLs prepended and the last four characters
truncated. In another case I found a block of over 100 bytes (I got
tired of wading through it after a while) in the same form - four bytes
were inserted into the corrupted file, pushing the data back. In
hindsight I wish I had investigated that case further; as it is, I'm
not positive the dislocation was at a disk block boundary.

(I have one example I saved that appears NOT to begin at a block
boundary, with a dislocation that continues for at least 8KB (by spot
checking of cmp --verbose output).)

The machine is a PIII/850 on an Asus 440BX board with a Promise 20265
controller; the Seagate ST340016A is the only device connected to the
Promise's ports. There's 640MB of ECC'd memory on board, and I haven't
had an SBE reported on this system in a year or so (the last hardware
changes was two or three months ago). (I disabled the ECC monitoring
module while verifying this problem; made no difference.)

The "large file" I've been using (becuase it was where I first observed
an issue) was the XFree86 4.2.1 source archive. At 54MB, it is less
than 1/10th the size of physical RAM.

--
There is nothing perhaps so generally consoling to a man as a
well-established grievance; a feeling of having been injured,
on which his mind can brood from hour to hour, allowing him
to plead his own cause in his own court, within his own heart,
and always to plead it successfully. -- Anthony Trollope

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/