security: lockless stat() issues

From: Linus Torvalds
Date: Fri Oct 04 2013 - 15:33:23 EST


Guys,
we've been working on scaling path lookup again, and 3.12 will be
pretty kick-ass on many loads.

HOWEVER, there's one really annoying case: we do all the path lookup
in RCU mode, where we can avoid touching any dentry state etc at all,
but then to "finalize" the path lookup in order to use it long-term,
we have to increment reference counts etc. That can be somewhat
expensive for high-scalability cases, since it will force-dirty (and
thus make exclusive in the cache) the core dentry cache entry.

And that's all fine for a lot of the common cases: when you do things
like open a file, you really do have to increment the reference count,
and there's no question that we do the right thing. And most people
who really open files tend to work on them, so getting the dentry
exclusively is fine.

There's one very important exception, though: things like "stat()" and
"access()" do *not* open a file in order to hold on to it. And they
are quite common (stat() in particular), _and_ they are often done on
files that are shared and then passed by rather than worked on..

Now, interestingly, both stat() and access() actually _already_ do
some kind of retry for special circumstances (LOOKUP_REVAL), and that
really looks like it could be extended to just do the whole lookup in
RCU mode too, and thus avoid ever finalizing the pathname.

However, the LSM interface doesn't really allow for that.

So how do people feel about passing a "mode" value for
security_inode_getattr(), the same way we do for
security_inode_permission()? The only flag would be that MAY_NOT_BLOCK
flag that gets set for RCU lookup, and the semantics would be the same
(return -ECHILD if you need to sleep).

Attached is a patch that adds the interface, and then makes all
security layers just do that ECHILD thing (and nobody actually sets
MAY_NOT_BLOCK yet). So it's purely preparatory. It's also
insufficient, because we'll need the same kind o fflag for the
low-level filesystem "i_op->getattr()" call, but that's an independent
issue.

Al, any comments?

Linus

Attachment: patch.diff
Description: Binary data