Re: [External] Re: [PATCH] mm: memcontrol: fix forget to obtain the ref to objcg in split_page_memcg

From: Muchun Song
Date: Mon Apr 12 2021 - 06:54:15 EST


On Mon, Apr 12, 2021 at 6:42 PM Christian Borntraeger
<borntraeger@xxxxxxxxxx> wrote:
>
> FWIW, I was away the last week, and I checked yesterdays next (e99d8a849517) regression runs.
> I still do see errors in our CI system:
>
> [ 2263.021681] ------------[ cut here ]------------
> [ 2263.021697] percpu ref (obj_cgroup_release) <= 0 (0) after switching to atomic
> [ 2263.021748] WARNING: CPU: 4 PID: 0 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x1ea/0x1f8
> [ 2263.021756] Modules linked in: scsi_debug vfio_pci irqbypass vfio_virqfd kvm vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink dm_service_time zfcp scsi_transport_fc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua rpcrdma sunrpc rdma_ucm rdma_cm iw_cm ib_cm mlx5_ib dm_mod ib_uverbs ib_core s390_trng vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio eadm_sch zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 mlx5_core sha512_s390 sha256_s390 sha1_s390 sha_common nvme nvme_core pkey zcrypt rng_core autofs4 [last unloaded: vfio_ap]
> [ 2263.021820] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.0-20210412.rc6.git0.e99d8a849517.300.fc33.s390x+next #1
> [ 2263.021823] Hardware name: IBM 8561 T01 703 (LPAR)
> [ 2263.021825] Krnl PSW : 0704c00180000000 000000025b234c1e (percpu_ref_switch_to_atomic_rcu+0x1ee/0x1f8)
> [ 2263.021829] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [ 2263.021832] Krnl GPRS: c0000000fffeffff 00000002f7212818 0000000000000042 00000000fffeffff
> [ 2263.021834] 00000000ffffffea 0000038000000001 0000000000000000 000003800000017c
> [ 2263.021836] 000000025b980988 00000000b774d0e0 000003fee191d5d8 8000000000000000
> [ 2263.021838] 000000008034c000 00000002f7227570 000000025b234c1a 00000380000aba28
> [ 2263.021849] Krnl Code: 000000025b234c0e: e3309fe8ff04 lg %r3,-24(%r9)
> 000000025b234c14: c0e5001ebe92 brasl %r14,000000025b60c938
> #000000025b234c1a: af000000 mc 0,0
> >000000025b234c1e: a7f4ffcc brc 15,000000025b234bb6
> 000000025b234c22: 0707 bcr 0,%r7
> 000000025b234c24: 0707 bcr 0,%r7
> 000000025b234c26: 0707 bcr 0,%r7
> 000000025b234c28: eb6ff0480024 stmg %r6,%r15,72(%r15)
> [ 2263.021912] Call Trace:
> [ 2263.021914] [<000000025b234c1e>] percpu_ref_switch_to_atomic_rcu+0x1ee/0x1f8
> [ 2263.021917] ([<000000025b234c1a>] percpu_ref_switch_to_atomic_rcu+0x1ea/0x1f8)
> [ 2263.021919] [<000000025abe16fe>] rcu_do_batch+0x146/0x608
> [ 2263.021924] [<000000025abe5ff4>] rcu_core+0x124/0x1d0
> [ 2263.021926] [<000000025b62a222>] __do_softirq+0x13a/0x3c8
> [ 2263.021930] [<000000025ab5d3f6>] irq_exit+0xce/0xf8
> [ 2263.021934] [<000000025b61a5f6>] do_ext_irq+0xd6/0x160
> [ 2263.021937] [<000000025b627c3c>] ext_int_handler+0xc4/0xf4
> [ 2263.021939] [<0000000000000000>] 0x0
> [ 2263.021943] [<000000025b62775a>] default_idle_call+0x42/0x110
> [ 2263.021945] [<000000025ab99328>] do_idle+0xd8/0x168
> [ 2263.021949] [<000000025ab99576>] cpu_startup_entry+0x36/0x40
> [ 2263.021952] [<000000025ab1f33a>] smp_start_secondary+0x82/0x88
> [ 2263.021955] Last Breaking-Event-Address:
> [ 2263.021955] [<000000025abc8828>] vprintk_emit+0xa8/0x110
> [ 2263.021961] Kernel panic - not syncing: panic_on_warn set ...
> [ 2263.021962] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.0-20210412.rc6.git0.e99d8a849517.300.fc33.s390x+next #1
> [ 2263.021964] Hardware name: IBM 8561 T01 703 (LPAR)
> [ 2263.021965] Call Trace:
> [ 2263.021966] [<000000025b60bc9a>] show_stack+0x92/0xd8
> [ 2263.021972] [<000000025b6161c0>] dump_stack+0x90/0xc0
> [ 2263.021975] [<000000025b60cab2>] panic+0x112/0x308
> [ 2263.021977] [<000000025ab5571a>] __warn+0xc2/0x158
> [ 2263.021981] [<000000025b2a5e4a>] report_bug+0xb2/0x130
> [ 2263.021984] [<000000025ab09ef4>] monitor_event_exception+0x44/0xc0
> [ 2263.021986] [<000000025b61a1e8>] __do_pgm_check+0xe0/0x1f0
> [ 2263.021988] [<000000025b627b30>] pgm_check_handler+0x118/0x160
> [ 2263.021990] [<000000025b234c1e>] percpu_ref_switch_to_atomic_rcu+0x1ee/0x1f8
> [ 2263.021992] ([<000000025b234c1a>] percpu_ref_switch_to_atomic_rcu+0x1ea/0x1f8)
> [ 2263.021993] [<000000025abe16fe>] rcu_do_batch+0x146/0x608
> [ 2263.021995] [<000000025abe5ff4>] rcu_core+0x124/0x1d0
> [ 2263.021997] [<000000025b62a222>] __do_softirq+0x13a/0x3c8
> [ 2263.021998] [<000000025ab5d3f6>] irq_exit+0xce/0xf8
> [ 2263.022000] [<000000025b61a5f6>] do_ext_irq+0xd6/0x160
> [ 2263.022001] [<000000025b627c3c>] ext_int_handler+0xc4/0xf4
> [ 2263.022003] [<0000000000000000>] 0x0
> [ 2263.022004] [<000000025b62775a>] default_idle_call+0x42/0x110
> [ 2263.022006] [<000000025ab99328>] do_idle+0xd8/0x168
> [ 2263.022008] [<000000025ab99576>] cpu_startup_entry+0x36/0x40
>
> So either the fix was not complete or it is still missing in next.

The fix now is on the mm-tree. I guess the branch you
tested does not contain this fix patch. You can check if
the function of obj_cgroup_get_many() exists. If it
doesn't exist, this means my guess is correct.

Thanks.