Re: [PATCH] vfs: fix race in rcu lookup of pruned dentry

From: Linus Torvalds
Date: Mon Jul 18 2011 - 14:12:32 EST

On Sun, Jul 17, 2011 at 11:31 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> Now, I do agree that maybe that case simply should check the dentry
> sequence count. I wish all cases did. Hugh patch did that. But the
> reason I dislike Hugh's patch is that when I say "I wish they all
> did", I mean that I dislike the special casing. And Hugh's patch just
> adds *more* special casing for that NULL entry - I'd wish we just
> always did it regardless of whether it was NULL or not.

Btw, looking at that, I think Hugh's patch is wrong. It does

if (!read_seqcount_retry(&dentry->d_seq, nd->seq))

but that's after we've done the __follow_mount_rcu() that may actually
have changed "nd->seq" to the mount-point inode (and has changed
path->dentry to match it).

Now, it only does it if inode is NULL, so I guess it doesn't matter,
but it's the kind of inconsistency that I think is really dangerous,
because it basically compares incompatible sequence numbers.

Also, looking at that whole mount-point traversal sequence, it looks
like __follow_mount_rcu() will happily totally ignore the old sequence
number when it replaces it with the mount-point sequence number. So it
looks to me that we have a case where we miss the sequence number
check that can happen with a positive dentry too!


So I think that whenever we change "nd->seq", we should always heck
the previous sequence number first (the way do_lookup() itself does
for the *normal* traversal case). Otherwise we will have traversed the
mount-point without ever having checked the previous sequence number.

Something like the (untested) attached patch.

Comments? This mount-point case is independent of the negative dentry
issue, and probably never really an issue in practice, but...

fs/namei.c | 13 ++++++++++++-
1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5c867dd1c0b3..f78d2af3aef4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -938,6 +938,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
for (;;) {
struct vfsmount *mounted;
+ unsigned int seq;
* Don't forget we might have a non-mountpoint managed dentry
* that wants to block transit.
@@ -952,9 +954,18 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry, 1);
if (!mounted)
+ seq = read_seqcount_begin(&mounted->mnt_root->d_seq);
+ /*
+ * The memory barrier in read_seqcount_begin() is sufficient,
+ * so we can use __read_seqcount_retry() to check the prev
+ * sequence numbers.
+ */
+ if (!__read_seqcount_retry(&path->dentry->d_seq, nd->seq))
+ return false;
path->mnt = mounted;
path->dentry = mounted->mnt_root;
- nd->seq = read_seqcount_begin(&path->dentry->d_seq);
+ nd->seq = seq;
return true;