list_add corruption in __percpu_counter_init

From: Marc Kleine-Budde
Date: Mon Mar 08 2021 - 06:49:36 EST


Hello,

since updating one of our compile cluster machines from kernel 4.19+105+deb10u9 to
5.10.0-0.bpo.3-amd64 #1 Debian 5.10.13-1~bpo10+1 we're hit by this bug
every 1...2 days:

list_add corruption. next->prev should be prev (ffffffff889a9840), but was 0000000000000000. (next=ffff9c3dcaf2a310).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:25!
invalid opcode: 0000 [#1] SMP PTI
CPU: 9 PID: 3281 Comm: python3 Not tainted 5.10.0-0.bpo.3-amd64 #1 Debian 5.10.13-1~bpo10+1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.00.0059.082320111421 08/23/2011
RIP: 0010:__list_add_valid.cold.0+0x12/0x28
Code: 80 46 32 88 48 89 ef e8 62 3e 01 00 48 c7 c0 f0 ff ff ff e9 bb 1e bf ff 48 89 d1 48 c7 c7 f8 46 32 88 48 89 c2 e8 02 2a ff ff <0f> 0b 48 89 c1 4c 89 c6 48 c7 c7 50 47 32 88 e8 ee 29 ff ff 0f 0b
RSP: 0018:ffffbb0862ba3910 EFLAGS: 00010046
RAX: 0000000000000075 RBX: ffff9c3dcaf2a268 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9c46ffb18a00 RDI: ffff9c46ffb18a00
RBP: ffff9c3dcaf2a278 R08: 0000000000000000 R09: c0000000ffffdfff
R10: 0000000000000001 R11: ffffbb0862ba3720 R12: ffff9c3dcaf2a310
R13: ffffffff889a9840 R14: 0000000000000002 R15: 0000000000002800
FS: 00007f4bfc65d740(0000) GS:ffff9c46ffb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000001f5b100 CR3: 00000013298c4001 CR4: 00000000000206e0
Call Trace:
__percpu_counter_init+0x64/0xa0
blkg_rwstat_init+0x3c/0xb0
throtl_pd_alloc+0x63/0x230
blkg_alloc+0x134/0x180
blkg_create+0x232/0x320
? kmem_cache_alloc+0x30c/0x420
bio_associate_blkg_from_css+0x1cb/0x2c0
bio_associate_blkg+0x20/0x70
? ktime_get+0x3e/0xa0
linear_map+0x50/0x90 [dm_mod]
__map_bio+0x3a/0x130 [dm_mod]
__split_and_process_non_flush+0x190/0x1e0 [dm_mod]
dm_submit_bio+0x152/0x3b0 [dm_mod]
submit_bio_noacct+0xfb/0x410
? iomap_page_mkwrite_actor+0x70/0x70
submit_bio+0x43/0x190
iomap_readahead+0xb5/0x190
read_pages+0x8e/0x270
page_cache_ra_unbounded+0x1a2/0x220
generic_file_buffered_read+0x1a9/0x9b0
xfs_file_buffered_aio_read+0x44/0xb0 [xfs]
xfs_file_read_iter+0x6e/0xd0 [xfs]
new_sync_read+0x118/0x1a0
vfs_read+0xf1/0x180
ksys_read+0x59/0xd0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9


and another one:

list_add corruption. next->prev should be prev (ffffffff8bba9840), but was 0000000000000000. (next=ffff9b12c9ce1b10).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:25!
invalid opcode: 0000 [#1] SMP PTI
CPU: 1 PID: 11618 Comm: qtdeclarative-n Not tainted 5.10.0-0.bpo.3-amd64 #1 Debian 5.10.13-1~bpo10+1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.00.0059.082320111421 08/23/2011
RIP: 0010:__list_add_valid.cold.0+0x12/0x28
Code: 80 46 52 8b 48 89 ef e8 62 3e 01 00 48 c7 c0 f0 ff ff ff e9 bb 1e bf ff 48 89 d1 48 c7 c7 f8 46 52 8b 48 89 c2 e8 02 2a ff ff <0f> 0b 48 89 c1 4c 89 c6 48 c7 c7 50

RSP: 0018:ffffbab42038b580 EFLAGS: 00010046
RAX: 0000000000000075 RBX: ffff9b1629c18268 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9b1bbfa18a00 RDI: ffff9b1bbfa18a00
RBP: ffff9b1629c18278 R08: 0000000000000000 R09: c0000000ffffdfff
R10: 0000000000000001 R11: ffffbab42038b390 R12: ffff9b12c9ce1b10
R13: ffffffff8bba9840 R14: 0000000000000006 R15: 0000000000002800
FS: 00007f9f93fb2740(0000) GS:ffff9b1bbfa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9f9385dbc0 CR3: 000000082c1f8001 CR4: 00000000000206e0
Call Trace:
__percpu_counter_init+0x64/0xa0
blkg_rwstat_init+0x3c/0xb0
throtl_pd_alloc+0x63/0x230
blkg_alloc+0x134/0x180
blkg_create+0x232/0x320
? find_busiest_group+0x41/0x360
bio_associate_blkg_from_css+0x1cb/0x2c0
bio_associate_blkg+0x20/0x70
? ktime_get+0x3e/0xa0
linear_map+0x50/0x90 [dm_mod]
__map_bio+0x3a/0x130 [dm_mod]
__split_and_process_non_flush+0x190/0x1e0 [dm_mod]
dm_submit_bio+0x152/0x3b0 [dm_mod]
submit_bio_noacct+0xfb/0x410
submit_bio+0x43/0x190
? bio_add_page+0x62/0x90
_xfs_buf_ioapply+0x2af/0x410 [xfs]
[...]


and a 3rd one without xfs involved

list_add corruption. next->prev should be prev (ffffffffa4ba9840), but was 0000000000000000. (next=ffff95071f379310).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:25!
invalid opcode: 0000 [#1] SMP PTI
CPU: 21 PID: 12894 Comm: vgs Not tainted 5.10.0-0.bpo.3-amd64 #1 Debian 5.10.13-1~bpo10+1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.00.0059.082320111421 08/23/2011
RIP: 0010:__list_add_valid.cold.0+0x12/0x28
Code: 80 46 52 a4 48 89 ef e8 62 3e 01 00 48 c7 c0 f0 ff ff ff e9 bb 1e bf ff 48 89 d1 48 c7 c7 f8 46 52 a4 48 89 c2 e8 02 2a ff ff <0f> 0b 48 89 c1 4c 89 c6 48 c7 c7 50 47 52 a4 e8 ee 29 ff ff 0f 0b
RSP: 0018:ffffad41c741fa80 EFLAGS: 00010046
RAX: 0000000000000075 RBX: ffff95071f379268 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff950dffc98a00 RDI: ffff950dffc98a00
RBP: ffff95071f379278 R08: 0000000000000000 R09: c0000000ffffdfff
R10: 0000000000000001 R11: ffffad41c741f890 R12: ffff95071f379310
R13: ffffffffa4ba9840 R14: 0000000000000006 R15: 0000000000002800
FS: 00007f320ee69900(0000) GS:ffff950dffc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f320f844000 CR3: 0000000f68ffe004 CR4: 00000000000206e0
Call Trace:
__percpu_counter_init+0x64/0xa0
blkg_rwstat_init+0x3c/0xb0
throtl_pd_alloc+0x63/0x230
blkg_alloc+0x134/0x180
blkg_create+0x232/0x320
bio_associate_blkg_from_css+0x1cb/0x2c0
bio_associate_blkg+0x20/0x70
blkdev_direct_IO+0x233/0x490
? aio_fsync_work+0x2c0/0x2c0
generic_file_read_iter+0x92/0x150
aio_read+0xf9/0x1a0
? io_submit_one+0x185/0xbc0
io_submit_one+0x185/0xbc0
? __fsnotify_parent+0x105/0x320
? __fsnotify_parent+0x115/0x320
? __x64_sys_io_submit+0xa6/0x1a0
__x64_sys_io_submit+0xa6/0x1a0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The system is a Intel(R) Xeon(R) CPU E5649 @ 2.53GHz with ECC RAM and
runs now stable on v4.19.

Is this a know (or even fixed problem?) Any ideas how to debug this?
I'll ask my admins to compile latest v5.10 and give it a try. As this is
one of our production machines testing is a bit limited.

regards,
Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung West/Dortmund | Phone: +49-231-2826-924 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

Attachment: signature.asc
Description: PGP signature