Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20

From: Chuck Lever
Date: Mon Sep 08 2008 - 17:16:17 EST

Next message: Andrew Morton: "Re: [PATCH] cgroup(fix critical bug): new handling for tasks file"
Previous message: Stephen Hemminger: "Re: [PATCH 1/3] PCI: vpd handle longer delays in access"
In reply to: Aaron Straus: "Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20"
Next in thread: Aaron Straus: "Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Sep 8, 2008 at 3:02 PM, Aaron Straus <aaron@xxxxxxxxxxxxx> wrote:
> On Sep 05 03:56 PM, Chuck Lever wrote:
>> Comparing a wire trace with strace output, starting with the writing
>> client, might also be illuminating. We prefer wireshark as it uses
>> good default trace settings, parses the wire bytes and displays them
>> coherently, and allows you to sort the frames in various useful ways.
>
> OK in addition to the bisection I've collected trace data for the good
> (commit 4d770ccf4257b23a7ca2a85de1b1c22657b581d8) and bad (commit
> e261f51f25b98c213e0b3d7f2109b117d714f69d) cases.
>
> Attached is a file called trace.tar.bz2 inside you'll find 4 files, for
> the two sessions:
>
> bad-wireshark
> bad-strace
>
> good-wireshark
> good-strace
>
> From a quick glance the difference seems to be the bad case does an
> UNSTABLE NFS WRITE call. I don't really know what that means or what
> its semantics are... but that bad commit does seem to introduce this
> regression.

A little analysis reveals the bad behavior on the wire. I filtered
the network traces for just WRITE requests. There were six in the
good trace, and five in the bad. This seems to correspond with the
strace logs, which show that you stopped the bad run before it
generated more writes.

Good:

time: offset: len: sync: byte range:
35.402734 0 8000 FILE_SYNC [0, 7999]
100.405170 4096 11904 FILE_SYNC [4096, 15999]
160.407822 12288 7712 FILE_SYNC [12288, 19999]
195.411035 16384 11616 FILE_SYNC [16384, 27999]
260.413574 24576 11424 FILE_SYNC [24576, 35999]
295.704044 32768 7232 FILE_SYNC [32768, 39999]

Bad:

time: offset: len: sync: byte range:
36.989907 0 8000 FILE_SYNC [0, 7999]
101.991164 4096 11904 FILE_SYNC [4096, 15999]
161.992447 12288 7712 FILE_SYNC [12288, 19999]
166.992203 20480 3520 UNSTABLE [20480, 23999]
169.181642 16384 4096 FILE_SYNC [16384, 20479]

The first thing I notice is that the client is re-writing areas of the
file. It seems to want to generate page aligned WRITEs, but this
means it writes over bytes that were already sent to the server.
Probably not the issue here, but that's certainly not optimal
behavior. The strace logs show that the application generates 4000
byte write(2) calls, without any operation interleaved that might
change the seek offset.

I might go so far to say that this is incorrect behavior as multiple
clients writing to a file not on page boundaries would clobber each
other's WRITEs.

The second thing is that UNSTABLE WRITE. The fact that it is UNSTABLE
is not the problem here. The byte range is what is most interesting:
it leaves a hole in the file, which is fixed by the next write (3
seconds later) which fills in that hole.

Normally this isn't a big deal, as NFS likes applications to serialize
access to shared files. However, in this case we expect the .flush()
to force all dirty bytes out at once, and this is definitely not
occurring....

The third thing is that I don't see any flush or fsync system calls in
the strace logs, even though the writer.py script invokes fp.flush().
I'm not terribly familiar with Python's flush(). How do you expect
the flush() method to work?

I wonder why the client is using FILE_SYNC writes and not
UNSTABLE+COMMIT. Are the clients using the "sync" mount option? I
don't see O_SYNC on the open.

open("test-nfs", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3

I think I saw some recent work in Trond's development branch that
makes some changes in this area. I will wait for him to respond to
this thread.

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: [PATCH] cgroup(fix critical bug): new handling for tasks file"
Previous message: Stephen Hemminger: "Re: [PATCH 1/3] PCI: vpd handle longer delays in access"
In reply to: Aaron Straus: "Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20"
Next in thread: Aaron Straus: "Re: [NFS] blocks of zeros (NULLs) in NFS files in kernels >= 2.6.20"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]