Re: [PATCH] cgroup: don't queue css_release_work if one already pending

From: Tadeusz Struk
Date: Wed May 18 2022 - 12:48:32 EST


On 4/22/22 04:05, Michal Koutný wrote:
On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo <tj@xxxxxxxxxx> wrote:
If this is the case, we need to hold an extra reference to be put by the
css_killed_work_fn(), right?

I looked into it a bit more lately and found that there already is such
a fuse in kill_css() [1].

At the same type syzbots stack trace demonstrates the fuse is
ineffective

css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146 (**)
percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline]
percpu_ref_put include/linux/percpu-refcount.h:338 [inline]
percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline] (*)
percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199
rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485
rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722
rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735
__do_softirq+0x27e/0x596 kernel/softirq.c:305

(*) this calls css_killed_ref_fn confirm_switch
(**) zero references after confirmed kill?

So, I was also looking at the possible race with css_free_rwork_fn()
(from failed css_create()) but that would likely emit a warning from
__percpu_ref_exit().

So, I still think there's something fishy (so far possible only via
artificial ENOMEM injection) that needs an explanation...

I can't reliably reproduce this issue on neither mainline nor v5.10, where
syzbot originally found it. It still triggers for syzbot though.

--
Thanks,
Tadeusz