Re: [RFC][PATCHSET][CFT] pathwalk cleanups and fixes

From: Ian Kent
Date: Sun Jan 19 2020 - 09:34:12 EST


On Sun, 2020-01-19 at 03:14 +0000, Al Viro wrote:
> OK, vfs.git #work.namei seems to survive xfstests. I think
> it cleans the things quite a bit, but it obviously needs more
> review and testing.
>
> Review and testing would be _very_ welcome; it does a lot
> of massage, so there had been a plenty of opportunities to fuck up
> and fail to spot that. The same goes for profiling - it doesn't
> seem to slow the things down, but that needs to be verified.

I have run my usual tests (the second run of my submount-test is still
going) and they have run through fine.

I spend what time I can looking through the series tomorrow but will
probably need to complete that when I return from my trip to Albany
(Western Australia) some time on Friday.

>
> It does include #work.openat2. Topology: 17 commits, followed
> by clean merge with #work.openat2, followed by 9 followups. The
> part is #work.openat2 is as posted by Aleksa; I can repost it, but
> I don't see much point. Description of the rest follows; patches
> themselves will be in followups.
>
> part 1: follow_automount() cleanups and fixes.
>
> Quite a bit of that function had been about working around the
> wrong calling conventions of finish_automount(). The problem is that
> finish_automount() misuses the primitive intended for mount(2) and
> friends, where we want to mount on top of the pile, even if something
> has managed to add to that while we'd been trying to lock the
> namespace.
> For automount that's not the right thing to do - there we want to
> discard
> whatever it was going to attach and just cross into what got mounted
> there in the meanwhile (most likely - the results of the same
> automount
> triggered by somebody else). Current mainline kinda-sorta manages to
> do
> that, but it's unreliable and very convoluted. Much simpler approach
> is to stop using lock_mount() in finish_automount() and have it bail
> out if something turns out to have been mounted on top where we
> wanted
> to attach. That allows to get rid of a lot of PITA in the caller.
> Another simplification comes from not trying to cross into the
> results
> of automount - simply ride through the next iteration of the loop and
> let it move into overmount.
>
> Another thing in the same series is divorcing
> follow_automount()
> from nameidata; that'll play later when we get to unifying
> follow_down()
> with the guts of follow_managed().
>
> 4 commits, the second one fixes a hard-to-hit race. The first
> is a prereq for it.
>
> 1/17 do_add_mount(): lift lock_mount/unlock_mount into callers
> 2/17 fix automount/automount race properly
> 3/17 follow_automount(): get rid of dead^Wstillborn code
> 4/17 follow_automount() doesn't need the entire nameidata
>
> part 2: unifying mount traversals in pathwalk.
>
> Handling of mount traversal (follow_managed()) is currently
> called
> in a bunch of places. Each of them is shortly followed by a call of
> step_into() or an open-coded equivalent thereof. However, the
> locations
> of those step_into() calls are far from preceding follow_managed();
> moreover, that preceding call might happen on different paths that
> converge to given step_into() call. It's harder to analyse that it
> should
> be (especially when it comes to liveness analysis) and it forces
> rather
> ugly calling conventions on
> lookup_fast()/atomic_open()/lookup_open().
> The series below massages the code to the point when the calls of
> follow_managed() (and __follow_mount_rcu()) move into the beginning
> of
> step_into().
>
> 5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying
> O_NOFOLLOW
> gets EEXIST handling in do_last() past the step_into() call
> there.
> 6/17 handle_mounts(): start building a sane wrapper for
> follow_managed()
> rather than mangling follow_managed() itself (and creating
> conflicts
> with openat2 series), add a wrapper that will absorb the
> required
> interface changes.
> 7/17 atomic_open(): saner calling conventions (return dentry on
> success)
> struct path passed to it is pure out parameter; only dentry
> part
> ever varies, though - mnt is always nd->path.mnt. Just return
> the dentry on success, and ERR_PTR(-E...) on failure.
> 8/17 lookup_open(): saner calling conventions (return dentry on
> success)
> propagate the same change one level up the call chain.
> 9/17 do_last(): collapse the call of path_to_nameidata()
> struct path filled in lookup_open() call is eventually given to
> handle_mounts(); the only use it has before that is
> path_to_nameidata()
> call in "->atomic_open() has actually opened it" case, and
> there
> path_to_nameidata() is an overkill - we are guaranteed to
> replace
> only nd->path.dentry. So have the struct path filled only
> immediately
> prior to handle_mounts().
> 10/17 handle_mounts(): pass dentry in, turn path into a pure out
> argument
> now all callers of handle_mount() are directly preceded by
> filling
> struct path it gets. path->mnt is nd->path.mnt in all cases,
> so we can
> pass just the dentry instead and fill path in handle_mount()
> itself.
> Some boilerplate gone, path is pure out argument of
> handle_mount()
> now.
> 11/17 lookup_fast(): consolidate the RCU success case
> massage to gather what will become an RCU case equivalent of
> handle_mounts(); basically, that's what we do if revalidate
> succeeds
> in RCU case of lookup_fast(), including unlazy and fallback to
> handle_mounts() if __follow_mount_rcu() says "it's too tricky".
> 12/17 teach handle_mounts() to handle RCU mode
> ... and take that into handle_mount() itself. The other caller
> of
> __follow_mount_rcu() is fine with the same fallback (it just
> didn't
> bother since it's in the very beginning of pathwalk), switched
> to
> handle_mount() as well.
> 13/17 lookup_fast(): take mount traversal into callers
> Now we are getting somewhere - both RCU and non-RCU success
> cases of
> lookup_fast() are ended with the same return
> handle_mounts(...);
> move that to the callers - there it will merge with the
> identical calls
> that had been on the paths where we had to do slow lookups.
> lookup_fast() returns dentry now.
> 14/17 new step_into() flag: WALK_NOFOLLOW
> use step_into() instead of open-coding it in
> handle_lookup_down().
> Add a flag for "don't follow symlinks regardless of
> LOOKUP_FOLLOW" for
> that (and eventually, I hope, for .. handling).
> Now *all* calls of handle_mounts() and step_into() are right
> next to
> each other.
> 15/17 fold handle_mounts() into step_into()
> ... and we can move the call of handle_mounts() into
> step_into(),
> getting a slightly saner calling conventions out of that.
> 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into
> path_lookupat()
> another payoff from 14/17 - we can teach path_lookupat() to do
> what path_mountpointat() used to. And kill the latter, along
> with
> its wrappers.
> 17/17 expand the only remaining call of path_lookup_conditional()
> minor cleanup - RIP path_lookup_conditional(). Only one caller
> left.
>
> At that point we run out of things that can be done without textual
> conflicts
> with openat2 series. Changes so far:
> * mount traversal is taken into step_into().
> * lookup_fast(), atomic_open() and lookup_open() calling
> conventions
> are slightly changed. All of them return dentry now, instead of
> returning
> an int and filling struct path on success. For lookup_fast() the old
> "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for
> cache
> miss, dentry - for hit".
> * step_into() can be called in RCU mode as well. Takes
> nameidata,
> WALK_... flags, dentry and, in RCU case, corresponding inode and seq
> value.
> Handles mount traversals, decides whether it's a symlink to be
> followed.
> Error => returns -E...; symlink to follow => returns 1, puts symlink
> on stack;
> non-symlink or symlink not to follow => returns 0, moves nd->path to
> new location.
> * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and
> friends
> became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in
> flags.
>
> Next comes the merge with Aleksa's openat2 patchset; everything up to
> that point
> had been non-conflicting with it. That patchset has been posted
> earlier;
> it's in #work.openat2. The next series comes on top of the merge.
>
> part 3: untangling the symlink handling.
>
> Right now when we decide to follow a symlink it happens this
> way:
> * step_into() decides that it has been given a symlink that
> needs to
> be followed.
> * it calls pick_link(), which pushes the symlink on stack and
> returns 1 on success / -E... on error. Symlink's mount/dentry/seq is
> stored on stack and the inode is stashed in nd->link_inode.
> * step_into() passes that 1 to its callers, which proceed to
> pass it
> up the call chain for several layers. In all cases we get to
> get_link()
> call shortly afterwards.
> * get_link() is called, picks the inode stashed in nd-
> >link_inode
> by the pick_link(), does some checks, touches the atime, etc.
> * get_link() either picks the link body out of inode or calls
> ->get_link(). If it's an absolute symlink, we move to the root and
> return
> the relative portion of the body; if it's a relative one - just
> return the
> body. If it's a procfs-style one, the call of nd_jump_link() has
> been
> made and we'd moved to whatever location is desired. And return
> NULL,
> same as we do for symlink to "/".
> * the caller proceeds to deal with the string returned to it.
>
> The sequence is the same in all cases (nested symlink, trailing
> symlink on lookup, trailing symlink on open), but its pieces are not
> close
> to each other and the bit between the call of pick_link() and
> (inevitable)
> call of get_link() afterwards is not easy to follow. Moreover, a
> bunch
> of functions (walk_component/lookup_last/do_last) ends up with the
> same
> conventions for return values as step_into(). And those conventions
> (see above) are not pretty - 0/1/-E... is asking for mistakes,
> especially
> when returned 1 is used only to direct control flow on a rather
> twisted
> way to matching get_link() call. And that path can be seriously
> twisted.
> E.g. when we are trying to open /dev/stdin, we get the following
> sequence:
> * path_init() has put us into root and returned "/dev/stdin"
> * link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> * we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast(). Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing. Into the stack it goes and we return 1.
> * do_last() sees 1 and returns it.
> * trailing_symlink() is called (in the top-level loop) and it
> calls get_link(). OK, we get "/proc/self/fd/0" for body, move to
> root again and return "proc/self/fd/0".
> * link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> * do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> * _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now). It's dropped (from step_into()) and we
> get
> to look at what we'd been given. A symlink to follow, so on the
> stack
> it goes and we return 1.
> * again, do_last() passes 1 to caller
> * trailing_symlink() is called and calls get_link().
> * this time it's a procfs symlink and its ->get_link() method
> moves us to the mount/dentry of our stdin. And returns NULL. But
> the
> fun doesn't stop yet.
> * trailing_symlink() returns "" to the caller
> * link_path_walk() is called on that and does nothing
> whatsoever.
> * do_last() is called and sees LAST_BIND left by the
> get_link().
> It calls handle_dots()
> * handle_dots() drops the symlink from stack and returns
> * do_last() *FINALLY* proceeds to the point after its call of
> step_into() (finish_open:) and gets around to opening the damn thing.
>
> Making sense of the control flow through all of that is not
> fun,
> to put it mildly; debugging anything in that area can be a massive
> PITA,
> and this example has touched only one of 3 cases. Arguably, the
> worst
> one, but... Anyway, it turns out that this code can be massaged to
> considerably saner shape - both in terms of control flow and wrt
> calling
> conventions.
>
> 1/9 merging pick_link() with get_link(), part 1
> prep work: move the "hardening" crap from trailing_symlink()
> into
> get_link() (conditional on the absense of LOOKUP_PARENT in nd-
> >flags).
> We'll be moving the calls of get_link() around quite a bit through
> that
> series, and the next step will be to eliminate trailing_symlink().
> 2/9 merging pick_link() with get_link(), part 2
> fold trailing_symlink() into lookup_last() and do_last().
> Now these are returning strings; it's not the final calling
> conventions,
> but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...)
> =>
> old -E..., we'd failed. string => old 1, and the string is the
> symlink
> body to follow. Just as for trailing_symlink(), "/" and procfs ones
> (where get_link() returns NULL) yield "", so the ugly song and dance
> with no-op trip through link_path_walk()/handle_dots() still remains.
> 3/9 merging pick_link() with get_link(), part 3
> elimination of that round-trip. In *all* cases having
> get_link() return NULL on such symlinks means that we'll proceed to
> drop the symlink from stack and get back to the point near that
> get_link() call - basically, where we would be if it hadn't been
> a symlink at all. The path by which we are getting there depends
> upon the call site; the end result is the same in all cases - such
> symlinks (procfs ones and symlink to "/") are fully processed by
> the time get_link() returns, so we could as well drop them from the
> stack right in get_link(). Makes life simpler in terms of control
> flow analysis...
> And now the calling conventions for do_last() and lookup_last()
> have reached the final shape - ERR_PTR(-E...) for error, NULL for
> "we are done", string for "traverse this".
> 4/9 merging pick_link() with get_link(), part 4
> now all calls of walk_component() are followed by the same
> boilerplate - "if it has returned 1, call get_link() and if that
> has returned NULL treat that as if walk_component() has returned 0".
> Eliminate by folding that into walk_component() itself. Now
> walk_component() return value conventions have joined those of
> do_last()/lookup_last().
> 5/9 merging pick_link() with get_link(), part 5
> same as for the previous, only this time the boilerplate
> migrates one level down, into step_into(). Only one caller of
> get_link() left, step_into() has joined the same return value
> conventions.
> 6/9 merging pick_link() with get_link(), part 6
> move that thing into pick_link(). Now all traces of
> "return 1 if we are following a symlink" are gone.
> 7/9 finally fold get_link() into pick_link()
> ta-da - expand get_link() into the only caller. As a side
> benefit, we get rid of stashing the inode in nd->link_inode - it
> was done only to carry that piece of information from pick_link()
> to eventual get_link(). That's not the main benefit, though - the
> control flow became considerably easier to reason about.
>
> For what it's worth, the example above (/dev/stdin) becomes
> * path_init() has put us into root and returned "/dev/stdin"
> * link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> * we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast(). Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing. On the stack it goes and we get its body. Which is
> "/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
> * do_last() sees non-NULL and returns it - whether it's an
> error
> or a pathname to traverse, we hadn't reached something we'll be
> opening.
> * link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> * do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> * _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now). It's dropped (from step_into()) and we
> get
> to look at what we'd been given. A symlink to follow, so on the
> stack
> it goes. This time it's a procfs symlink and its ->get_link()
> method
> moves us to the mount/dentry of our stdin. And returns NULL. So we
> drop symlink from stack and return that NULL to caller.
> * that NULL is returned by step_into(), same as if we had just
> moved to a non-symlink.
> * do_last() proceeds to open the damn thing.
>
> part 4. some mount traversal cleanups.
>
> 8/9 massage __follow_mount_rcu() a bit
> make it more similar to non-RCU counterpart
> 9/9 new helper: traverse_mounts()
> the guts of follow_managed() are very similar to
> follow_down(). The calling conventions are different
> (follow_managed()
> works with nameidata, follow_down() - with standalone struct path),
> but the core loop is pretty much the same in both. Turned that loop
> into a common helper (traverse_mounts()) and since follow_managed()
> becomes a very thin wrapper around it, expand follow_managed() at its
> only call site (in handle_mounts()),
>
> That's where the series stands right now. FWIW, at 5.5-rc1
> fs/namei.c
> had been 4867 lines, at the tip of #work.openat2 - 4998, at the
> tip of #work.namei (containing #work.openat2) - 4730... And IMO
> the thing has become considerably easier to follow.
>
> What's more, it might be possible to untangle the control flow in
> do_last() now. Probably a separate series, though - do_last() is
> one hell of a tarpit, so I'm not stepping into it for the rest
> of this cycle...
>