Re: fs: NULL deref in atime_needs_update

From: Dmitry Vyukov
Date: Wed Feb 24 2016 - 05:03:45 EST


On Wed, Feb 24, 2016 at 5:46 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote:
>> On Wed, 2016-02-17 at 00:40 +0100, MickaÃl SalaÃn wrote:
>> > Hi,
>> >
>> > Actually I found the same bug (without fuzzing) and I can reproduce it
>> > in a deterministic way (e.g. by creating a LSM that return 1 for the
>> > security_file_open hook). At least, from v4.2.8 I can easily trigger
>> > traces like this :
>>
>> Reading through this thread I wonder if this is a new problem.
>>
>> Is there a previous kernel it can't be reproduced on?
>> Perhaps a bisect will shed some light on what's happening.
>
> There are several things in the mix. What MickaÃl has found is that a bunch
> of places where _positive_ number returned instead of expected 0 or -E... can
> get propagated all way back to caller of do_last(), confusing the hell out
> of it. That's not what Dmitry has triggered, though. WARN_ON() in the
> end of do_last() would've triggered, and IMO this one, along with mitigation
> (map that "error value" to -EINVAL) should go into mainline and all -stable.
> Bogus ->open() returning a positive number had always been bad news; in the
> best case it would be returned to userland, leading to "open(2) has failed
> and returned a positive number". Hell knows if we ever had such instances
> (or have them right now), but I wouldn't bet on their absense. Rare
> failure exits returning bogus values in an ->open() instance in some driver
> can easily stay unnoticed for a long time. Starting from at least 3.6
> (circa the atomic_open support) it got more unpleasant than simple "confuse
> the hell out of userland". ->open() isn't the only vector for injection of
> such crap - ->permission() would also serve, same for several LSM turds, etc.
>
> Again, that's a separate problem. What Dmitry seems to be catching is getting
> crap values fed to should_follow_link() as inode. I see one bug in that
> area that does need fixing (fix present in the last patch I've posted, with
> WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should
> be gone from the final variant, since this can trigger without driver bugs,
> etc.) In RCU mode after we'd checked that dentry is a symlink one, we need
> to verify that it hadn't been changed since before we'd fetched the inode.
> It might have been e.g. a regular file, which got unlinked with symlink
> created in its place. Then we'd go into get_link() with non-symlink inode
> and oops on attempt to call its ->i_op->get_link(). That race is real, very
> hard to hit (you need both the unlink(2) and symlink(2) to happen between
> lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you
> can't lose the timeslice there) and would've manifested differently.
>
> But that leaves other two kinds of bugs getting triggered: inode of some
> non-symlink is possible, but what we saw included NULL inode when we'd
> reached finish_open: in do_last(). Should be flat-out impossible - we either
> get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or
> get NULL inode from pinned d_is_symlink() dentry after having grabbed
> a reference and left RCU mode. Neither should be possible without either
> something weird happening to lookup_fast() (and we would've seen oopsen in
> link_path_walk() if that could happen, BTW) or screwed dentry refcounting
> somewhere, combined with a race that managed to turn...
>
> Oh, shit. No screwed refcounting is needed. Look:
> BUG_ON(nd->flags & LOOKUP_RCU);
> inode = d_backing_inode(path.dentry);
> seq = 0; /* out of RCU mode, so the value doesn't matter */
> if (unlikely(d_is_negative(path.dentry))) {
> path_to_nameidata(&path, nd);
> return -ENOENT;
> }
> Suppose we come here with negative path.dentry. We are holding a reference,
> all right, and for a _postive_ dentry that would've been enough to keep
> it positive. Not so for a negative one, though - symlink(2) on another
> CPU doint d_instantiate() just before the d_is_negative() check and we
> are fucked - inode is stale and we sail through all the checks, all the
> way into should_follow_link().
>
> We also have the same kind of crap in walk_component() -
> err = lookup_slow(nd, &path);
> if (err < 0)
> return err;
> inode = d_backing_inode(path.dentry);
> seq = 0; /* we are already out of RCU mode */
> err = -ENOENT;
> if (d_is_negative(path.dentry))
> goto out_path_put;
> There it's much harder to hit, though - we need it not just d_instantiate()
> overlapping those lines; we need the racing syscall to get from locking
> the parent to d_instantiate() between the point where lookup_slow() has
> unlocked the parent and d_is_negative(). And lookup_slow() couldn't have
> gone into mountpoint crossing, so it's really hard to hit - you pretty
> much have to get preempted just after fetching inode.
>
> OK, the next delta to try, and there definitely are several commits in
> that pile. It still does not explain the traces with GPF at 0x50, though -
> for that we need not just a NULL getting to should_follow_link() but
> something non-NULL with NULL at offset 40 from it (offset of ->i_sb in
> struct inode). That something *can't* be a valid struct inode or had been
> one in recent past - ->i_sb is assigned in new_inode(), value is always
> non-NULL and never modified all the way until RCU-delayed freeing of struct
> inode. So that has to be something entirely different... Anyway, the
> patch so far follows:


Restarted testing with this patch (dropped previous patches because
they conflict).

These "unlikely" scenarios can be more likely inside of VMs where
effective preemption can happen at random points. Also NMIs probably
can increase probability of such races.