[PATCH] exit: Retain nsproxy for exit_task_work() work entries

From: Michal Koutný
Date: Wed Dec 08 2021 - 13:06:17 EST


The reported issue is an attempted write in a cgroup file, by a zombie
who has/had acct(2) enabled. Such a write needs cgroup_ns for access
checking. Ordinary acct_process() would not be affected by this as it is
called well before exit_task_namespaces(). However, the reported NULL
dereference is a different acct data writer:

Call Trace:
<TASK>
kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296
__kernel_write+0x5d1/0xaf0 fs/read_write.c:535
do_acct_process+0x112a/0x17b0 kernel/acct.c:518
acct_pin_kill+0x27/0x130 kernel/acct.c:173
pin_kill+0x2a6/0x940 fs/fs_pin.c:44
mnt_pin_kill+0xc1/0x170 fs/fs_pin.c:81
cleanup_mnt+0x4bc/0x510 fs/namespace.c:1130
task_work_run+0x146/0x1c0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0x705/0x24f0 kernel/exit.c:832
do_group_exit+0x168/0x2d0 kernel/exit.c:929
get_signal+0x16b0/0x2090 kernel/signal.c:2820
arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae

i.e. called as one of task_work_run() entries.

The historical commit 8aac62706ada ("move exit_task_namespaces() outside
of exit_notify()") argues that exit_task_namespaces() must come before
exit_task_work() because ipc_ns cleanup calls fput/task_work_add.

There is accompanying commit e7b2c4069252 ("fput: task_work_add() can
fail if the caller has passed exit_task_work()") in the original series
that makes fput() robust in situations when task_work_add() cannot be
used anymore.

So in order to ensure that task_work_run() entries of the exiting task
have the nsproxy still available, swap the order of
exit_task_namespaces() and exit_task_work().

This change may appear like a partial revert of 8aac62706ada but this
particular ordering change shouldn't matter with the fix from
e7b2c4069252 and the other reason for 8aac62706ada (keeping exit_notify
simpler) still holds.

Reported-by: syzbot+50f5cf33a284ce738b62@xxxxxxxxxxxxxxxxxxxxxxxxx
Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@xxxxxxxxxx
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Michal Koutný <mkoutny@xxxxxxxx>
---
kernel/exit.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

I wasn't able to reproduce the syzbot's crash manually so the effectiveness of
the fix is only based on the reasoning.

diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..0c2abdebb87c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -828,8 +828,8 @@ void __noreturn do_exit(long code)
exit_fs(tsk);
if (group_dead)
disassociate_ctty(1);
- exit_task_namespaces(tsk);
exit_task_work(tsk);
+ exit_task_namespaces(tsk);
exit_thread(tsk);

/*
--
2.33.1