Re: regression, bisected: getcwd() ENOENT on NFS4...

From: Daniel J Blueman
Date: Fri Nov 13 2009 - 07:52:10 EST


Hi Trond,

On Fri, Nov 6, 2009 at 12:41 AM, Daniel J Blueman
<daniel.blueman@xxxxxxxxx> wrote:
> On Thu, Nov 5, 2009 at 5:45 PM, Trond Myklebust
> <trond.myklebust@xxxxxxxxxx> wrote:
>> On Wed, 2009-11-04 at 09:36 +0000, Daniel J Blueman wrote:
>>> On Sun, Nov 1, 2009 at 12:47 PM, Daniel J Blueman
>>> <daniel.blueman@xxxxxxxxx> wrote:
>>> > Hi Trond,
>>> >
>>> > On Mon, Oct 26, 2009 at 1:19 PM, Trond Myklebust
>>> > <Trond.Myklebust@xxxxxxxxxx> wrote:
>>> >> On Sun, 2009-10-25 at 23:31 +0000, Daniel J Blueman wrote:
>>> >>> Since 2.6.30-rc, I've been experiencing various issues relating to
>>> >>> getcwd() returning ENOENT on NFS4 clients. I used an over-complicated
>>> >>> but reliable reproducer [1] (on Karmic RC against a 2.6.32-rc5 NFS4
>>> >>> server) to bisect [2].
>>> >>>
>>> >>> The impact of this regression is moderate (side-effects range from
>>> >>> benign to failure), so we should get a fix into 2.6.32 if at all
>>> >>> possible and strongly consider a 2.6.31 stable update.
>>> >>>
>>> >>> Thanks,
>>> >>>   Daniel
>>> >>>
>>> >>> --- [1]
>>> >>>
>>> >>> $ apt-get source apt
>>> >>> $ cd apt-*
>>> >>> $ ./configure && make
>>> >>> [snip]
>>> >>> sh: getcwd() failed: No such file or directory
>>> >>>
>>> >>> --- [2]
>>> >>>
>>> >>> a65318bf3afc93ce49227e849d213799b072c5fd is first bad commit
>>> >>> commit a65318bf3afc93ce49227e849d213799b072c5fd
>>> >>> Author: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
>>> >>> Date:   Wed Mar 11 14:10:28 2009 -0400
>>> >>>
>>> >>>     NFSv4: Simplify some cache consistency post-op GETATTRs
>>> >>
>>> >> I'm having a lot of trouble seeing how this patch could result in
>>> >> ENOENT. All it should be doing is reducing the frequency with which we
>>> >> update some of the inode metadata.
>>> >>
>>> >> Have you ever been able to capture one of these errors using strace?
>>> >
>>> > Backing this patch out by hand against stock 2.6.32-rc5 (w/ 2.6.32-rc5
>>> > on server) corrects the behaviour. It's readily reproducible [1];
>>> > using 2.6.30, the issue is not seen, thus is a regression.
>>> >
>>> > To observe the change to user-level behaviour (after the reproducer commands):
>>> > # make clean
>>> > # strace -ffe getcwd make -n >list
>>> > [pid  3829] getcwd(0x7fffa269a380, 4096) = -1 ENOENT (No such file or directory)
>>> > make: getcwd: No such file or directory
>>> >
>>> > Would this help for me to log this via a bugzilla.kernel.org ticket?
>>> >
>>> > Thanks,
>>> >  Daniel
>>> >
>>> > --- [1]
>>> >
>>> > booting eg:
>>> > http://mira.sunsite.utk.edu/ubuntu-releases/karmic/ubuntu-9.10-desktop-amd64.iso
>>> >
>>> > $ sudo bash
>>> > # apt-get install build-essential
>>> > # apt-get build-dep apt
>>> > # mount server:/ /mnt -tnfs4 && cd /mnt
>>> > # apt-get source apt
>>> > # cd apt-0.7.23.1ubuntu2
>>> > # ./configure && make
>>> >  -> "getcwd: No such file or directory" messages observed with cited
>>> > patch and not without
>>>
>>> For continuity with the mailing list thread, I've created a bug report
>>> of this at:
>>>
>>> http://bugzilla.kernel.org/show_bug.cgi?id=14541
>>
>> I just committed the following patch into the above bugzilla entry. I
>> hope it suffices to fix the bug.
>>
>> Cheers
>>  Trond
>> -------------------------------------------------------------------
>> NFSv4: Fix a cache validation bug which causes getcwd() to return ENOENT
>> From: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
>>
>> Changeset a65318bf3afc93ce49227e849d213799b072c5fd (NFSv4: Simplify some
>> cache consistency post-op GETATTRs) incorrectly changed the getattr
>> bitmap for readdir().
>> This causes the readdir() function to fail to return a
>> fileid/inode number, which again exposed a bug in the NFS readdir code that
>> causes spurious ENOENT errors to appear in applications (see
>> http://bugzilla.kernel.org/show_bug.cgi?id=14541).
>>
>> The immediate band aid is to revert the incorrect bitmap change, but more
>> long term, we should change the NFS readdir code to cope with the
>> fact that NFSv4 servers are not required to support fileids/inode numbers.
>>
>> Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
>> ---
>>
>>  fs/nfs/nfs4proc.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>>
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index ff37454..741a562 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -2767,7 +2767,7 @@ static int _nfs4_proc_readdir(struct dentry *dentry, struct rpc_cred *cred,
>>                .pages = &page,
>>                .pgbase = 0,
>>                .count = count,
>> -               .bitmask = NFS_SERVER(dentry->d_inode)->cache_consistency_bitmask,
>> +               .bitmask = NFS_SERVER(dentry->d_inode)->attr_bitmask,
>>        };
>>        struct nfs4_readdir_res res;
>>        struct rpc_message msg = {
>>
>
> This fixes the behaviour and passes some heavy testing with two good
> test-cases, with 2.6.32-rc6. As well, this would be good value for the
> stable stream. I've sync'd the bugzilla report.

Is there opportunity to get this regression fix into 2.6.32-rc8, since
-rc8 may be the (pen)ultimate before final?

Thanks,
Daniel
--
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/