Filesystem corruption on RH2.2.5 over a read-only NFS export

Brian Strand (brians@switchmanagement.com)
Sat, 16 Oct 1999 19:44:33 -0700


Yesterday I experienced filesystem corruption running an unmodified
Redhat 2.2.5 kernel (2.2.5-15 for i386) and the accompanying
knfsd-1.2.2-4. Luckily the corruption was in two source files rather
than a binary, so I recovered the originals from CVS. My setup is a
Linux 2.2.5 box NFS-serving a directory read-only to a SCO 3.2v5.0.5 box
and a Solaris 2.7 (UltraSparc) box.

I believe the corruption occurred when I was copying a small directory
from the NFS-exported filesystem to the sco box's local filesystem (via
cp -r ... at the sco box's console), but I am not sure. This is the
only thing which I can remember that I did differently. I have been
coding on the Linux box and compiling (VPATH builds to local
directories) on the other two boxen for weeks without any problems.

The files suffered slightly different forms of corruption. File one,
child.cpp, had its second block duplicated as its third block, and had
its second block show up as the third block of install_cdrserver (the
other corrupt file). Also, the tenth block of child.cpp simply
disappeared. File two, install_cdrserver, had its contents completely
replaced as follows:

chunk 1, bytes 0-715: first 716 bytes of yet another file
(install_functions), which
was not corrupted.

chunk 2, bytes 716-1023: all nulls.

chunk 3, bytes 1024-1355: what looks like the CVS/Entries file for the
directory.
chunk 4, bytes 1354-2047: all nulls.

chunk 5, bytes 2048-3071: the second block of child.cpp

chunk 6, bytes 3072-3115: a piece of the CVS/Entries file for another
directory
chunk 7, bytes 3116-3261: all nulls.

The length of the pre-corruption file was 3262 bytes, which adds up
correctly.

I received about 1300 warnings in the system log along the lines of

ext2_free_blocks: Freeing blocks not in datazone - block = 1025538412,
count = 1

with wildly varying block numbers (but count unchanging), interspersed
with the occasional

ext2_free_blocks: bit already cleared for block 1192428
ext2_free_inode: bit already cleared for inode 297275
lookup_by_inode: ino 297275 not found in src
find_fh_dentry: 03:05/297275 dir/297242 not found!
ext2_free_blocks: bit already cleared for block 1192416

where the block and inode numbers were all the same. There were murmurs
of unrest in the logs here and there for the past few weeks (e.g. "NFS:
lookup failed, error=-116" and "nfs_revalidate_inode: /// getattr
failed, ino=65536, error=-116"), but I chalked it up to me (foolishly)
moving the exported directory without unexporting it, and then
recreating it from CVS, re-exporting, and remounting on the sco and sun
boxen several hours before the corruption occurred. Although now that I
look a bit more closely, it looks like after several "authenticated
mount request from ..." messages, I got several (in some cases, pages
of) "ext2_free_blocks: bit already cleared for block 734964" where the
block number varied. These messages go back to about the time where I
started editing on the Linux box and building on the other boxen.

As this is our mail, DNS, and CVS machine I have not been so eager to
upgrade, but if no one has any other suggestions I will upgrade to
2.2.12 with knfsd-1.4.7 (the latest stable knfsd as far as I can tell)
and see if I have any more problems. If anyone wants to see the logs,
or any other information, or beat me soundly about the head with a clue
stick, please CC me.

Brian Strand
Switch Management

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/