Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks

From: Linus Torvalds
Date: Fri May 15 2015 - 22:23:29 EST


On Fri, May 15, 2015 at 6:55 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> See upthread. It might be doable (provided that we turn ->i_mutex into
> rwsem, to keep the exclusion with directory _modifiers_), but it'll need
> a really non-trivial code review of a bunch of filesystems, especially ones
> that want to play with the list of children like ceph does. And things
> like sillyrename and dcache-populating readdir instances, albeit not as scary
> as ceph. And then there's lustre...

Yup.

I don't think it's viable if we can't do it gradually, and leave
filesystems with the option to basically keep the existing locking.
Because most won't care that deeply anyway, and some have
complications like ceph.

But we might be able to do *some* changes that wouldn't be that
noticeable. For example, something like

- phase 1:

Turn i_mutex into an rwsem, change all users to take it for writing

This part should be pretty much a semantic no-op.

- phase 2:

For filesystems that say that they are ok with, make lookup_slow()
(and *only* lookup_slow for now) instead take the rwsem for reading,
but in addition to that, take a hashed mutex.

By "hashed mutex", I mean having a smallish table of mutexes (say,
1024), and just creating a hash based on the name-hash and the parent
pointer. That way we can avoid all the issues with adding a new lock
to the dentry itself, or having to allocate a new child dentry just
for the lock. It *could* cause some cross-directory serialization due
to hash collisions, but that shouldn't be noticeable if the hash is of
a reasonable size and quality.

That would allow lookups (and _only_ lookups) to happen in parallel,
but the hashed mutex would mean that you'd serialize the "same name in
same directory" case. And we'd require filesystems to say "I can
support this concurrent lookup model".

There might be a "phase 3" and so on where we could expand this to
slightly more than just lookup_slow(), but I suspect that even doing
it *just* there would already catch the bulk of issues. And requiring
filesystems to sign up for it means that we can ignore any ugly cases.

I dunno. The above _sounds_ fairly safe and easy because of how it
limits the impact. But I might be missing something.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/