Re: Data corruption with NFS in 2.1.x?

Linus Torvalds (torvalds@transmeta.com)
17 Jan 1998 20:10:46 GMT


In article <34C0F145.23CC28FB@star.net>, Bill Hawes <whawes@star.net> wrote:
>Heinz Ulrich Stille wrote:
>
>> The file lengths always are the same. I just now thought of comparing versions
>> compiled under 2.0.x and 2.1.x and noticed that differences occur always in
>> places where the file from 2.0.x has zeroes; a typical result looks like this:
>> (from cmp -i 16 -l -c)
>>
>> bi-reverse.o:
>> -rw-r--r-- 1 root root 3240 Jan 16 23:40 bi-reverse.o
>> 2466 0 ^@ 331 M-Y
>> 2467 0 ^@ 30 ^X
>> 2468 0 ^@ 10 ^H
>
>Thanks, I'm making progress on tracking the problem down, and hopefully
>will have a patch for testing soon.

Bill and Heinz, one thing to look out for is programs that use
"[f]truncate()" and "lseek()" when they try to be clever about avoiding
large areas of zero blocks in files. Those kinds of programs tend to do
something like:

- notice that they are writing zeroes.
- instead of writing the zero area, do an "ftruncate()+lseek()" to the
first non-zero position.

The reason they tend to do this is to avoid using up disk-space: if you
actually do a write() of the zero area then the disk blocks will be
allocated, but if you just jump over the area then many filesystems will
not use any physical diskspace at all for the zeroes if they are
properly aligned etc..

In particular, the NFS client might have missed some clearing of the
page cache for the ftruncate() - or possibly there is a pending
asynchronous write() and the truncate gets re-ordered around it or
similar...

Linus