Re: [PATCH] vfs: fix race in rcu lookup of pruned dentry

From: Hugh Dickins
Date: Sun Jul 17 2011 - 19:32:08 EST


On Sun, 17 Jul 2011, Linus Torvalds wrote:
> On Sun, Jul 17, 2011 at 2:03 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> >
> > That -ENOENT in walk_component: isn't it assuming we found a negative
> > dentry, before reaching the read_seqcount_retry which complete_walk
> > (or nameidata_drop_rcu_last before 3.0) would use to confirm a successful
> > lookup?
>
> Hmm. I think you're right. The ENOENT will basically short-circuit the
> full proper checks.
>
> >  And can't memory pressure prune a dentry, coming to dentry_kill
> > which __d_drops to unhash before dentry_iput resets d_inode to NULL, but
> > the dentry_rcuwalk_barrier between those is ineffective if the other end
> > ignores the seqcount?
>
> Yes. However, looking at it, I'm not very happy with your patch. It
> doesn't really make sense to me to special-case the NULL inode and
> only do a seq_retry for that case.
>
> I kind of see why you do it for that particular bug, but at the same
> time, it just makes me go "Eww". If that inode isn't NULL yet, you
> then return the dentry that can get a NULL d_inode later. So the only
> special thing there is that we happen to check for a NULL inode there.
> What protects *later* checks for a NULL d_inode?

I was imagining that all the later uses of the inode were using
walk_component()'s local struct inode *inode, or nd->inode which
it sets on success. Until complete_walk(), or the next level down
of lookup, has validated that stage by checking nd->seq.

If any does dereference dentry->d_inode in between, then it would
already be oopsing in this situation, which I've not seen.

>
> So my gut feel is that we should instead
>
> - either remove the -ENOENT return at that point entirely, and move
> it to after we have re-verified the dentry lookup for other reasons.
> That looks pretty involved, though, and those paths do end up
> accessing inode data structures etc, so it looks less than trivial.
>
> OR
>
> - simply just not clear d_inode at all in d_kill(), so that when we
> prune a dentry due to memory pressure, it doesn't actually change the
> state of the dentry.

But whether my imagining was right or wrong, your -1 line patch
looks a much nicer solution.

>
> and I think the second solution is the right one. It's kind of odd:
> we'll have called down to the iput() routine, and the inode will be
> "gone", but that is already true for the *normal* race of actually
> deleting a file too, and we have that whole "inodes are RCU-released",
> so the inode allocation will still exist.

Yes, the inodes are RCU-released too, so that side of it fits okay.

>
> So my gut feel is that we should instead just do this:
>
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -187,7 +187,6 @@ static void dentry_iput(struct dentry * dentry)
> {
> struct inode *inode = dentry->d_inode;
> if (inode) {
> - dentry->d_inode = NULL;
> list_del_init(&dentry->d_alias);
> spin_unlock(&dentry->d_lock);
> spin_unlock(&inode->i_lock);
>
> and see what the fall-out from that would be. Nobody should then *use*
> the stale inode, because __d_drop has done that
> dentry_rcuwalk_barrier(). So we avoid the NULL inode special case
> entirely.
>
> Comments?

At first it looked worrying to interfere with the sequence
"inode = dentry->d_inode; if (inode) { dentry->d_inode = NULL;"
but seeing as dentry_iput() is only called from the one place,
I think the test is merely about negative dentries, and setting
d_inode to NULL is just long-standing tidiness, nothing vital.

>
> The above (whitespace-damaged) patch may look trivial, but it is
> *entirely* untested, and maybe my gut feel that the above is the right
> way to solve the problem is just wrong.
>
> Al, any reactions? Hugh, does the above patch work for your
> stress-test case? Or, indeed, at all?

Well, my stress tests don't blow up in the first half hour.
But I was not successful in re-reproducing the issue (I got
yesterday in 2.5 hours) within 7 hours, so it's probably going
to take days to be sure that your minus-1-liner fixes it.

I certainly like your patch, though I'm not quite as confident
with it as I was with mine (until you pointed out its inconsistency).
Just a nagging doubt that leaving d_inode set may be wrong for somewhere.

Hugh