[RFC][PATCH] cachefiles: Split and rename S_KERNEL_FILE and extend effects

From: David Howells
Date: Fri Jan 21 2022 - 08:48:36 EST


Hi Linus,

After the recent controversy over the S_KERNEL_FILE flag, I wonder if it
should be split into two different flags with different functions and have
more appropriate names given to reflect this to make its function easier to
understand. I've put in some not particulary great suggestions as to the
naming, but something better might suggest itself to others. I've also left
the flags generic as there's nothing specifically about them that means they
couldn't be used by other kernel drivers too.

Hopefully, I've also done a better job of explaining the issues (including a
couple of races) the flags fix in the attached patch.

Amir's suggestion of moving the no-remove check into may_delete()
unfortunately doesn't work because the flag forbidding the VFS op (eg. rmdir)
needs to be done with the lock held, to stop the flag from being changed until
the VFS op is complete.

David
---
Split S_KERNEL_FILE into two separate flags to do two separate jobs and give
them new names[1][2]:

(1) S_INUSE. This allows a kernel service to mark an inode as in-use by
that kernel service. This is then used by cachefiles to indicate that
it is using a file or directory. Cachefiles uses this for three
things:

(a) Defending against double-use of part of the cache due to bugs,
races or misconfiguration - which could lead to data corruption.

(b) As a fast way to tell whether a file is in use when cachefilesd
asks to cull it (culling is offloaded to userspace). Previously,
the inode details were used to look up in a big tree of records -
but doing the same job with a single bit is less expensive.

(c) To stop a file that we've agreed cachefilesd may cull from being
reused by the cache. This fixes a race between the cull request
and the object being buried because we have to drop the inode
locks that we've taken so that we can call functions like
vfs_unlink().

The race gives a window in which the object can get looked up
again - but if the file is in a cullable state, there's nowhere to
make a note that it needs to be got rid of except the backing
inode.

(2) S_NOREMOVE. If this is set, the file or directory may not be removed,
renamed or unlinked. This can then be used by cachefiles to prevent
userspace removing or renaming files and directories that the are
being used.

The directory layout in its cache is very important to cachefiles as
the names are how it indexes the contents. Removing objects can cause
cachefilesd to malfunction as it may find it can't reach bits of the
structure that it previously created and still has dentry pointers to.

This also closes a race between cachefilesd trying to cull an empty
directory and cachefiles trying to create something in it.

Amir Goldstein suggested that the check in vfs_rmdir() should be moved
to may_delete()[1], but it really needs to be done whilst the inode
lock is held.

Both of these flags can only be changed under the inode lock. I would
recommend that S_NOREMOVE only be set/cleared if S_INUSE is first set.

Note that potential usage of these flags is by no means limited to
cachefiles, which is why they've been left as general. S_INUSE, for
example, if set by one kernel service, will cause another kernel service
that uses the flag in the same way to abort that use.

Note also that S_NOREMOVE will prevent vfs_unlink() vfs_rmdir() and
vfs_rename() from operating on a file.

Also define IS_xxx() macros for the above flags[3].

Signed-off-by: David Howells <dhowells@xxxxxxxxxx>
cc: linux-cachefs@xxxxxxxxxx
Link: https://lore.kernel.org/r/CAOQ4uxjEcvffv=rNXS-r+NLz+=6yk4abRuX_AMq9v-M4nf_PtA@xxxxxxxxxxxxxx [1]
Link: https://lore.kernel.org/r/Ydvl8Dk8z0mF0KFl@xxxxxxxxxxxxx/ [2]
Link: https://lore.kernel.org/r/88d7f8970dcc0fd0ead891b1f42f160b8d17d60e.camel@xxxxxxxxxx/ [3]
---
fs/cachefiles/namei.c | 18 ++++++++++--------
fs/namei.c | 8 +++++---
include/linux/fs.h | 5 ++++-
3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index f256c8aff7bb..30b7b71158c4 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -20,8 +20,8 @@ static bool __cachefiles_mark_inode_in_use(struct cachefiles_object *object,
struct inode *inode = d_backing_inode(dentry);
bool can_use = false;

- if (!(inode->i_flags & S_KERNEL_FILE)) {
- inode->i_flags |= S_KERNEL_FILE;
+ if (!IS_INUSE(inode)) {
+ inode->i_flags |= S_INUSE | S_NOREMOVE;
trace_cachefiles_mark_active(object, inode);
can_use = true;
} else {
@@ -53,7 +53,7 @@ static void __cachefiles_unmark_inode_in_use(struct cachefiles_object *object,
{
struct inode *inode = d_backing_inode(dentry);

- inode->i_flags &= ~S_KERNEL_FILE;
+ inode->i_flags &= ~(S_INUSE | S_NOREMOVE);
trace_cachefiles_mark_inactive(object, inode);
}

@@ -392,8 +392,10 @@ int cachefiles_bury_object(struct cachefiles_cache *cache,
};
trace_cachefiles_rename(object, d_inode(rep)->i_ino, why);
ret = cachefiles_inject_read_error();
- if (ret == 0)
+ if (ret == 0) {
+ __cachefiles_unmark_inode_in_use(object, rep);
ret = vfs_rename(&rd);
+ }
if (ret != 0)
trace_cachefiles_vfs_error(object, d_inode(dir), ret,
cachefiles_trace_rename_error);
@@ -402,7 +404,6 @@ int cachefiles_bury_object(struct cachefiles_cache *cache,
"Rename failed with error %d", ret);
}

- __cachefiles_unmark_inode_in_use(object, rep);
unlock_rename(cache->graveyard, dir);
dput(grave);
_leave(" = 0");
@@ -426,6 +427,7 @@ int cachefiles_delete_object(struct cachefiles_object *object,
dget(dentry);

inode_lock_nested(d_backing_inode(fan), I_MUTEX_PARENT);
+ cachefiles_unmark_inode_in_use(object, object->file);
ret = cachefiles_unlink(volume->cache, object, fan, dentry, why);
inode_unlock(d_backing_inode(fan));
dput(dentry);
@@ -746,7 +748,7 @@ static struct dentry *cachefiles_lookup_for_cull(struct cachefiles_cache *cache,
goto lookup_error;
if (d_is_negative(victim))
goto lookup_put;
- if (d_inode(victim)->i_flags & S_KERNEL_FILE)
+ if (IS_INUSE(d_inode(victim)))
goto lookup_busy;
return victim;

@@ -793,11 +795,11 @@ int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
/* check to see if someone is using this object */
inode = d_inode(victim);
inode_lock(inode);
- if (inode->i_flags & S_KERNEL_FILE) {
+ if (IS_INUSE(inode)) {
ret = -EBUSY;
} else {
/* Stop the cache from picking it back up */
- inode->i_flags |= S_KERNEL_FILE;
+ inode->i_flags |= S_INUSE;
ret = 0;
}
inode_unlock(inode);
diff --git a/fs/namei.c b/fs/namei.c
index d81f04f8d818..99eddc41f6aa 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3959,7 +3959,7 @@ int vfs_rmdir(struct user_namespace *mnt_userns, struct inode *dir,

error = -EBUSY;
if (is_local_mountpoint(dentry) ||
- (dentry->d_inode->i_flags & S_KERNEL_FILE))
+ IS_NOREMOVE(dentry->d_inode))
goto out;

error = security_inode_rmdir(dir, dentry);
@@ -4090,7 +4090,8 @@ int vfs_unlink(struct user_namespace *mnt_userns, struct inode *dir,
inode_lock(target);
if (IS_SWAPFILE(target))
error = -EPERM;
- else if (is_local_mountpoint(dentry))
+ else if (is_local_mountpoint(dentry) ||
+ IS_NOREMOVE(dentry->d_inode))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -4603,7 +4604,8 @@ int vfs_rename(struct renamedata *rd)
goto out;

error = -EBUSY;
- if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry) ||
+ IS_NOREMOVE(old_dentry->d_inode))
goto out;

if (max_links && new_dir != old_dir) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5d3bf5b69a6..68cae4aaa6ff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2173,7 +2173,8 @@ struct super_operations {
#define S_ENCRYPTED (1 << 14) /* Encrypted file (using fs/crypto/) */
#define S_CASEFOLD (1 << 15) /* Casefolded file */
#define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */
-#define S_KERNEL_FILE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_INUSE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_NOREMOVE (1 << 18) /* File is not to be removed or renamed */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2216,6 +2217,8 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)
#define IS_VERITY(inode) ((inode)->i_flags & S_VERITY)
+#define IS_INUSE(inode) ((inode)->i_flags & S_INUSE)
+#define IS_NOREMOVE(inode) ((inode)->i_flags & S_NOREMOVE)

#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)