Re: [PATCH] vfs: fix race in rcu lookup of pruned dentry

From: Linus Torvalds
Date: Sun Jul 17 2011 - 18:00:35 EST

On Sun, Jul 17, 2011 at 2:03 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> That -ENOENT in walk_component: isn't it assuming we found a negative
> dentry, before reaching the read_seqcount_retry which complete_walk
> (or nameidata_drop_rcu_last before 3.0) would use to confirm a successful
> lookup?

Hmm. I think you're right. The ENOENT will basically short-circuit the
full proper checks.

>  And can't memory pressure prune a dentry, coming to dentry_kill
> which __d_drops to unhash before dentry_iput resets d_inode to NULL, but
> the dentry_rcuwalk_barrier between those is ineffective if the other end
> ignores the seqcount?

Yes. However, looking at it, I'm not very happy with your patch. It
doesn't really make sense to me to special-case the NULL inode and
only do a seq_retry for that case.

I kind of see why you do it for that particular bug, but at the same
time, it just makes me go "Eww". If that inode isn't NULL yet, you
then return the dentry that can get a NULL d_inode later. So the only
special thing there is that we happen to check for a NULL inode there.
What protects *later* checks for a NULL d_inode?

So my gut feel is that we should instead

- either remove the -ENOENT return at that point entirely, and move
it to after we have re-verified the dentry lookup for other reasons.
That looks pretty involved, though, and those paths do end up
accessing inode data structures etc, so it looks less than trivial.


- simply just not clear d_inode at all in d_kill(), so that when we
prune a dentry due to memory pressure, it doesn't actually change the
state of the dentry.

and I think the second solution is the right one. It's kind of odd:
we'll have called down to the iput() routine, and the inode will be
"gone", but that is already true for the *normal* race of actually
deleting a file too, and we have that whole "inodes are RCU-released",
so the inode allocation will still exist.

So my gut feel is that we should instead just do this:

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -187,7 +187,6 @@ static void dentry_iput(struct dentry * dentry)
struct inode *inode = dentry->d_inode;
if (inode) {
- dentry->d_inode = NULL;

and see what the fall-out from that would be. Nobody should then *use*
the stale inode, because __d_drop has done that
dentry_rcuwalk_barrier(). So we avoid the NULL inode special case


The above (whitespace-damaged) patch may look trivial, but it is
*entirely* untested, and maybe my gut feel that the above is the right
way to solve the problem is just wrong.

Al, any reactions? Hugh, does the above patch work for your
stress-test case? Or, indeed, at all?

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at