Re: [PATCH RFC] proc: Fix a dentry lock race between release_task and lookup

From: Eric W. Biederman
Date: Tue May 31 2022 - 12:57:52 EST


Zhihao Cheng <chengzhihao1@xxxxxxxxxx> writes:

> Commit 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc")
> moved proc_flush_task() behind __exit_signal(). Then, process systemd
> can take long period high cpu usage during releasing task in following
> concurrent processes:
>
> systemd ps
> kernel_waitid stat(/proc/pid)
> do_wait filename_lookup
> wait_consider_task lookup_fast
> release_task
> __exit_signal
> __unhash_process
> detach_pid
> __change_pid // remove task->pid_links
> d_revalidate -> pid_revalidate // 0
> d_invalidate(/proc/pid)
> shrink_dcache_parent(/proc/pid)
> d_walk(/proc/pid)
> spin_lock_nested(/proc/pid/fd)
> // iterating opened fd
> proc_flush_pid |
> d_invalidate (/proc/pid/fd) |
> shrink_dcache_parent(/proc/pid/fd) |
> shrink_dentry_list(subdirs) ↓
> shrink_lock_dentry(/proc/pid/fd) ---> race on dentry lock
>
> Function d_invalidate() will remove dentry from hash firstly, but why does
> proc_flush_pid() process dentry '/proc/pid/fd' before dentry '/proc/pid'?
> That's because proc_pid_make_inode() adds proc inode in reverse order by
> invoking hlist_add_head_rcu(). But proc should not add any inodes under
> '/proc/pid' except '/proc/pid/task/pid', fix it by adding inode into
> 'pid->inodes' only if the inode is /proc/pid or /proc/pid/task/pid.

If I understand correctly you are saying that under some circumstances
this code runs slow, and you are proposing an optimization.

That optimization is to change the content of the pid->inodes list
from all directories under that pid, to just the /proc/<tgid> and
/proc/<tgid>/task/<pid>.

The justification being that d_invalidate on the parent directory will
invalidate all children. So only those two directories are interesting
from a d_invalidate point of view.

That seems like a valid optimization.

This could also count as a regression fix if you can show how the
performance changed poorly when the pid->inodes change was introduced
and how the performance improves with your change. I currently only
see that you hit a pathological case and you are correcting it.

As for the actual code change I think it would be better to
remove the code from proc_pid_make_inode and make a helper
proc_pid_make_base_inode that performs the extra work of
adding to the pid->list. Not adding a flag makes the code
easier to follow.

Something like the code below.

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d654ce7150fd..9d025e70ddc3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1915,11 +1915,6 @@ struct inode *proc_pid_make_inode(struct super_block * sb,

/* Let the pid remember us for quick removal */
ei->pid = pid;
- if (S_ISDIR(mode)) {
- spin_lock(&pid->lock);
- hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
- spin_unlock(&pid->lock);
- }

task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
security_task_to_inode(task, inode);
@@ -1932,6 +1927,27 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
return NULL;
}

+struct inode *proc_pid_make_base_inode(struct super_block * sb,
+ struct task_struct *task, umode_t mode)
+{
+ struct inode * inode;
+ struct proc_inode *ei;
+ struct pid *pid;
+
+ inode = proc_pid_make_inode(sb, task, mode);
+ if (!inode)
+ return NULL;
+
+ /* Let proc_flush_pid find this directory inode */
+ ei = PROC_I(inode);
+ pid = ei->pid;
+ spin_lock(&pid->lock);
+ hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
+ spin_unlock(&pid->lock);
+
+ return inode;
+}
+
int pid_getattr(struct user_namespace *mnt_userns, const struct path *path,
struct kstat *stat, u32 request_mask, unsigned int query_flags)
{
@@ -3351,7 +3367,7 @@ static struct dentry *proc_pid_instantiate(struct dentry * dentry,
{
struct inode *inode;

- inode = proc_pid_make_inode(dentry->d_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
+ inode = proc_pid_make_base_inode(dentry->d_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
if (!inode)
return ERR_PTR(-ENOENT);

@@ -3650,7 +3666,7 @@ static struct dentry *proc_task_instantiate(struct dentry *dentry,
struct task_struct *task, const void *ptr)
{
struct inode *inode;
- inode = proc_pid_make_inode(dentry->d_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
+ inode = proc_pid_make_base_inode(dentry->d_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
if (!inode)
return ERR_PTR(-ENOENT);


Eric