Re: [syzbot] WARNING in mntput_no_expire (3)

From: Al Viro
Date: Wed May 18 2022 - 00:57:59 EST


On Wed, May 18, 2022 at 04:38:53AM +0000, Al Viro wrote:
> On Wed, May 18, 2022 at 01:58:40AM +0000, Al Viro wrote:
> > On Wed, May 18, 2022 at 01:10:20AM +0000, Al Viro wrote:
> > > On Wed, May 18, 2022 at 12:59:46AM +0000, Al Viro wrote:
> > > > On Tue, May 17, 2022 at 10:58:15PM +0000, Al Viro wrote:
> > > > > On Tue, May 17, 2022 at 03:49:07PM -0700, syzbot wrote:
> > > > > > Hello,
> > > > > >
> > > > > > syzbot has tested the proposed patch but the reproducer is still triggering an issue:
> > > > > > WARNING in mntput_no_expire
> > > > >
> > > > > Obvious question: which filesystem it is?
> > > >
> > > > FWIW, can't reproduce here - at least not with C reproducer +
> > > > -rc7^ kernel + .config from report + debian kvm image (bullseye,
> > > > with systemd shite replaced with sysvinit, which might be relevant).
> > > >
> > > > In case systemd-specific braindamage is needed to reproduce it...
> > > > Hell knows; at least mount --make-rshared / doesn't seem to suffice.
> > >
> > > ... doesn't reproduce with genuine systemd either. FWIW, 4-way SMP
> > > setup here.
> >
> > OK, reproduced...
>
> FWIW, it smells like something (cgroup?) fucking up percpu allocation/freeing.
> Note that struct mount has both refcount and writers count held in percpu;
> replacing the refcount with atomic_t gets rid of seeing negative refcount
> in mntput_no_expire(), but leaves negative writers count caught in
> cleanup_mnt(); turn that from WARN_ON into printk and we get past that,
> only to see
> percpu ref (css_release) <= 0 (-4294967294)
> immediately afterwards.
>
> IOW, it looks like we are getting not messed refcounting on either side,
> but same refcount physically shared by unrelated objects.

Gotcha.
percpu_ref_init():
ref->percpu_count_ptr = (unsigned long)
__alloc_percpu_gfp(sizeof(unsigned long), align, gfp);
if (!ref->percpu_count_ptr)
return -ENOMEM;
data = kzalloc(sizeof(*ref->data), gfp);
if (!data) {
free_percpu((void __percpu *)ref->percpu_count_ptr);
return -ENOMEM;
}

cgroup_create():
err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
if (err)
goto err_free_css;

err = cgroup_idr_alloc(&ss->css_idr, NULL, 2, 0, GFP_KERNEL);
if (err < 0)
goto err_free_css;

Now note that we end up hitting the same path in case of successful and
failed percpu_ref_init(). With no way to tell if css->refcnt.percpu_count_ptr
is an already freed object or needs to be freed. And sure enough, we have

err_free_css:
list_del_rcu(&css->rstat_css_node);
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

with css_free_rwork_fn() starting with
percpu_ref_exit(&css->refcnt);

which will give that double free. That might be not the only cause of
trouble, but this looks like a bug and a plausible source of the
symptoms observed here. Let's see if this helps:

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index af9302141bcf..e5c5315da274 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -76,6 +76,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
data = kzalloc(sizeof(*ref->data), gfp);
if (!data) {
free_percpu((void __percpu *)ref->percpu_count_ptr);
+ ref->percpu_count_ptr = 0;
return -ENOMEM;
}