Re: [PATCH] vfs: fix race in rcu lookup of pruned dentry

From: Hugh Dickins
Date: Mon Jul 18 2011 - 18:43:56 EST


On Mon, 18 Jul 2011, Linus Torvalds wrote:
> On Mon, Jul 18, 2011 at 2:19 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> >
> > __d_lookup_rcu() is being careful about *inode, yes.
> >
> > But I'd forgotten it was even setting it: doesn't that setting get
> > overridden later by the more careless *inode = path->d_entry->d_inode
> > at the head of __follow_mount_rcu()'s loop?
> >
> > Perhaps that line just needs to be moved to the tail of the loop?
>
> Ahh. Bingo. Yes, I think you found it.
>
> I don't think it should touch that *inode value in
> __follow_mount_rcu() unless we actually followed a mount, exactly
> because it will overwrite the thing that we were so careful about in
> __d_lookup_rcu().
>
> So how about this patch that replaces the earlier mount-point sequence
> number one. The only difference is (as you mention) to just do the
> *inode update at the end of the loop, so that we don't overwrite the
> valid inode data with a non-checked one when we don't do anything.
>
> Untested. But this should make my propised change to fs/dcache.c be
> irrelevant, because whether we clear d_inode or not, the existing
> sequence number checks will catch it. Agreed?

Yes, that looks like THE right fix to me, making others irrelevant.
But I haven't tested it either, won't begin to do so until tonight.
And I fear I won't notice if there's any silly little bug in there,
which forces every fast RCU lookup to go the slow way instead!

Inlined for the benefit of wider scrutiny:

fs/namei.c | 21 +++++++++++++++++++--
1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5c867dd1c0b3..b9cd558a00be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -938,11 +938,12 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
{
for (;;) {
struct vfsmount *mounted;
+ unsigned int seq;
+
/*
* Don't forget we might have a non-mountpoint managed dentry
* that wants to block transit.
*/
- *inode = path->dentry->d_inode;
if (unlikely(managed_dentry_might_block(path->dentry)))
return false;

@@ -952,9 +953,25 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry, 1);
if (!mounted)
break;
+ seq = read_seqcount_begin(&mounted->mnt_root->d_seq);
+
+ /*
+ * The memory barrier in read_seqcount_begin() is sufficient,
+ * so we can use __read_seqcount_retry() to check the prev
+ * sequence numbers.
+ */
+ if (!__read_seqcount_retry(&path->dentry->d_seq, nd->seq))
+ return false;
path->mnt = mounted;
path->dentry = mounted->mnt_root;
- nd->seq = read_seqcount_begin(&path->dentry->d_seq);
+ nd->seq = seq;
+
+ /*
+ * Update the inode too. We don't need to re-check the
+ * dentry sequence number here after this d_inode read,
+ * because a mount-point is always pinned.
+ */
+ *inode = path->dentry->d_inode;
}
return true;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/