Re: [PATCH] reconnect_one(): fix a missing error code

From: J. Bruce Fields
Date: Thu Jun 15 2017 - 17:40:17 EST


On Thu, Jun 15, 2017 at 07:54:57AM +1000, NeilBrown wrote:
> On Wed, Jun 14 2017, J. Bruce Fields wrote:
>
> > On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
> >> I found this bug by reviewing places where we do ERR_PTR(0) (which is
> >> NULL).
> >>
> >> We used to return an error pointer if lookup_one_len() failed but we
> >> moved this code into a helper function and accidentally removed that.
> >> NULL is a valid return for this function but it's not what we intended.
> >>
> >> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
> >> Signed-off-by: Dan Carpenter <dan.carpenter@xxxxxxxxxx>
> >
> > ACK. Agreed that the current code is wrong, and that this is the
> > correct fix.
> >
> > What I don't quite understand yet is what the impact of the bug would
> > be.
> >
>
> It is interesting that reconnect_path() handles the possibility of
> reconnect_one() returning NULL, even though it will only do that if this
> "bug" is triggered.

As Dan says, you're missing a case.

> When that happens, the target_dir (a descendent of dentry) gets its
> DCACHE_DISCONNECTED flag cleared.
>
> The bug can presumably only be triggered by a race.
> We look through a directory to find the name for an inode
> (exportfs_get_name), then try to look up that name and it doesn't exist.

Wouldn't lookup_one_len succesfully return a negative dentry in that
case?

I think the error cases here are more likely due to permissions or IO
errors.

So, I wonder if you can get some kind of dcache corruption with an
uncached lookup of a directory with an ancestor that we lack permission
to.

> So presumably if you lose the race, some dentry will get
> DCACHE_DISCONNECTED cleared, even though it is still disconnected.
> This breaks a contract and can cause weirdness in dcache operations.
>
> If the lookup_one_len_unlocked() fails, we should probably retry, at
> least once. But if we do decide to give up, we shouldn't assume it all
> worked.
>
> So I suggest:
> - the fix as provided by Dan, plus
> - remove "if (!parent) break;" from reconnect_path(), plus
> - maybe retry the get_name/lookup_one operation once if the first
> attempt fails.

See the comments in the code--if we lose the race, then it's because of
a concurrent operation which should have done the reconnection for us.

--b.