Re: [PATCH] vfs: fix race in rcu lookup of pruned dentry

From: Linus Torvalds
Date: Mon Jul 18 2011 - 16:25:16 EST


On Mon, Jul 18, 2011 at 12:47 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> Huh?  We do __d_drop() in there, and do that before we start messing
> with ->d_inode.

Hmm. Yes, looking at it, the ordering all seems correct. But then what
did Hugh see at all?

The inode thing he got from d_inode is re-verified by
__d_lookup_rcu(). So if inode is NULL, that means that the other CPU
has done dentry_iput(), which means that __d_drop has already
happened, which means that the dentry has been removed from the hash
list *and* the count has been incremented.

So just judging by Hugh's thing, something is wrong there. Some
missing barrier, perhaps. But write_seqcount_barrier() does seem to
have the write barrier, and the __d_lookup_rcu() read barrier check is
using the proper "full" read barrier too.

So in order for __d_lookup_rcu() to return a NULL inode, we have the
following requirements:
- it needs to find the dentry (duh)
- the dentry sequence count gets read (proper read barrier after this)
- the dentry must still look hashed
- the dentry->d_inode must be NULL
- and the dentry sequence count gets re-read (proper read barrier
before this) and must match.

And right now I don't see how that can happen, exactly because
__d_lookup_rcu does the sequence point check. But that's what Hugh's
patch depends on: seeing a NULL d_inode race.

If we see d_inode being NULL, that means that the first sequence
number read must happen *after* we've set d_inode to NULL in
dentry_iput(), which happens *after* we've done the sequence number
increment. That part is fine: that means that the sequence numbers
will match (because both of them are the "later" one). No
inconsistency so far. d_inode being NULL is perfectly compatible with
no sequence number change, and that was what I thought was a bug in
this area at first.

But if the first sequence number read (the read_seqcount_begin() in
__d_lookup_rcu()) sees the later sequence number, that means that
__d_drop has already happened, and it should *also* see the
dentry->d_hash.pprev = NULL; that happened in __d_drop before the
dentry_rcuwalk_barrier(). We have both the read barrier in the
read_seqcount_begin() and the write barrier in the
dentry_rcuwalk_barrier() to guarantee that.

But if the __d_lookup_rcu() sees that, then the d_unhashed() should
have seen the NULL pprev pointer, and not ever returned the dentry,
because that __d_lookup_rcu() loop does

if (d_unhashed(dentry))
continue;

after reading the sequence count.

So how could Hugh's NULL inode ever happen in the first place? Even
with the current sources? It all looks solid to me now that I look at
all the details.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/