Re: Kernel NULL pointer deref and data corruptions with xfs on 6.1

From: Daniel Dao
Date: Mon Jul 24 2023 - 07:23:51 EST


Hi again,

We had another example of xarray corruption involving xfs and zsmalloc. We are
running zram as swap. We have 2 tasks deadlock waiting for page to be released

The following backtrace is from zsmalloc task
#0 context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5241:2)
#1 __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6554:8)
#2 schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6630:3)
#3 io_schedule (/cfsetup_build/build/linux/kernel/sched/core.c:8774:2)
#4 folio_wait_bit_common (/cfsetup_build/build/linux/mm/filemap.c:1296:4)
#5 folio_wait_locked
(/cfsetup_build/build/linux/include/linux/pagemap.h:1022:3)
#6 wait_on_page_locked
(/cfsetup_build/build/linux/include/linux/pagemap.h:1034:2)
#7 lock_zspage (/cfsetup_build/build/linux/mm/zsmalloc.c:1736:3)
#8 async_free_zspage (/cfsetup_build/build/linux/mm/zsmalloc.c:1974:3)
#9 process_one_work (/cfsetup_build/build/linux/kernel/workqueue.c:2289:2)
#10 worker_thread (/cfsetup_build/build/linux/kernel/workqueue.c:2436:4)
#11 kthread (/cfsetup_build/build/linux/kernel/kthread.c:376:9)
#12 ret_from_fork+0x22/0x2d
(/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:306)

The following backtrace is from a userspace task
#0 context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5241:2)
#1 __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6554:8)
#2 schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6630:3)
#3 io_schedule (/cfsetup_build/build/linux/kernel/sched/core.c:8774:2)
#4 folio_wait_bit_common (/cfsetup_build/build/linux/mm/filemap.c:1296:4)
#5 folio_put_wait_locked (/cfsetup_build/build/linux/mm/filemap.c:1465:9)
#6 filemap_update_page (/cfsetup_build/build/linux/mm/filemap.c:2472:4)
#7 filemap_get_pages (/cfsetup_build/build/linux/mm/filemap.c:2606:9)
#8 filemap_read (/cfsetup_build/build/linux/mm/filemap.c:2676:11)
#9 xfs_file_buffered_read
(/cfsetup_build/build/linux/fs/xfs/xfs_file.c:277:8)
#10 xfs_file_read_iter (/cfsetup_build/build/linux/fs/xfs/xfs_file.c:302:9)
#11 call_read_iter (/cfsetup_build/build/linux/include/linux/fs.h:2199:9)
#12 new_sync_read (/cfsetup_build/build/linux/fs/read_write.c:389:8)
#13 vfs_read (/cfsetup_build/build/linux/fs/read_write.c:470:9)
#14 ksys_read (/cfsetup_build/build/linux/fs/read_write.c:613:9)
#15 do_syscall_x64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:50:14)
#16 do_syscall_64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:80:7)
#17 entry_SYSCALL_64+0x83/0x164
(/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:120)

The folio in question has .mapping = (struct address_space
*)zsmalloc_mops+0x2 = 0xffffffffc1a9f332
and flag 'PG_locked|PG_waiters|PG_private|PG_slob_free'. In fact, the
file's i_pages
mapping has a node full of these pages. The following are entries we
get from mapping
in #6 at 0xffffffffa4e1c586 (filemap_get_pages+0x5d6/0x624) in
filemap_update_page at /cfsetup_build/build/linux/mm/filemap.c:2472:4
(inlined)

> for index, entry in xa_for_each(trace[6]['mapping'].i_pages.address_of_()):
print(index, entry, cast('struct folio *',
entry).page.mapping.address_of_())

2936 (void *)0xffffe53ab6454f00 *(struct address_space
**)0xffffe53ab6454f18 = 0xffff9ffc9ded16b0
2940 (void *)0xffffe53ab6454300 *(struct address_space
**)0xffffe53ab6454318 = 0xffff9ffc9ded16b0
2944 (void *)0xffffe53a02696000 *(struct address_space
**)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332 <==
index
2945 (void *)0xffffe53a02696000 *(struct address_space
**)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332
2946 (void *)0xffffe53a02696000 *(struct address_space
**)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332
...
2976 (void *)0xffffe53a02696000 *(struct address_space
**)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332 <==
last_index
...
3006 (void *)0xffffe53a02696000 *(struct address_space
**)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332
3007 (void *)0xffffe53ad71c37c0 *(struct address_space
**)0xffffe53ad71c37d8 = 0xffff9ffc9ded16b0

On Fri, Jul 21, 2023 at 11:49 AM Daniel Dao <dqminh@xxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> In the past, we reported some corruptions on xfs/iomap/xarray combinations on
> kernel 6.1. This happened very rarely ( once a week for every 10000 hosts), and
> the host exhibited symptoms such as: rcu_preempt self-detected stalls,
> NULL pointer
> dereferences or deadlock when reading a particular file.
>
> We do not have a reproducer yet, but we now have more debugging data
> which hopefully
> should help narrow this down. Details as followed:
>
> 1. Kernel NULL pointer deferencences in __filemap_get_folio
>
> This happened on a few different hosts, with a few different repeated addresses.
> The addresses are 0000000000000036, 0000000000000076,
> 00000000000000f6. This looks
> like the xarray is corrupted and we were trying to do some work on a
> sibling entry.
>
> BUG: kernel NULL pointer dereference, address: 0000000000000036
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 18806c5067 P4D 18806c5067 PUD 188ed48067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 73 PID: 3579408 Comm: prometheus Tainted: G O
> 6.1.34-cloudflare-2023.6.7 #1
> Hardware name: GIGABYTE R162-Z12-CD1/MZ12-HD4-CD, BIOS M03 11/19/2021
> RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29
> include/linux/atomic/atomic-arch-fallback.h:1242
> include/linux/atomic/atomic-arch-fallback.h:1267
> include/linux/atomic/atomic-instrumented.h:608
> include/linux/page_ref.h:238 include/linux/page_ref.h:247
> include/linux/page_ref.h:280 include/linux/page_ref.h:313
> mm/filemap.c:1863 mm/filemap.c:1915)
> Code: 10 e8 99 ac 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02
> 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b>
> 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34
> All code
> ========
> 0: 10 e8 adc %ch,%al
> 2: 99 cltd
> 3: ac lods %ds:(%rsi),%al
> 4: 84 00 test %al,(%rax)
> 6: 48 3d 06 04 00 00 cmp $0x406,%rax
> c: 49 89 c4 mov %rax,%r12
> f: 74 e2 je 0xfffffffffffffff3
> 11: 48 3d 02 04 00 00 cmp $0x402,%rax
> 17: 74 da je 0xfffffffffffffff3
> 19: 48 85 c0 test %rax,%rax
> 1c: 0f 84 2e 02 00 00 je 0x250
> 22: a8 01 test $0x1,%al
> 24: 0f 85 e3 00 00 00 jne 0x10d
> 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction
> 2d: 85 c0 test %eax,%eax
> 2f: 74 c2 je 0xfffffffffffffff3
> 31: 8d 50 01 lea 0x1(%rax),%edx
> 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
>
> Code starting with the faulting instruction
> ===========================================
> 0: 8b 40 34 mov 0x34(%rax),%eax
> 3: 85 c0 test %eax,%eax
> 5: 74 c2 je 0xffffffffffffffc9
> 7: 8d 50 01 lea 0x1(%rax),%edx
> a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
> RSP: 0000:ffffaf5587cdfc60 EFLAGS: 00010246
> RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000002
> RDX: 0000000000000008 RSI: ffffa45181fa8000 RDI: ffffaf5587cdfc70
> RBP: 0000000000000000 R08: 0000000000000402 R09: 000000000006e44f
> R10: 000000000006e450 R11: 000000000006e448 R12: 0000000000000002
> R13: ffffa3fff6fdfeb0 R14: 000000000006e44a R15: 00000000000000d1
> FS: 000000c9e385ac90(0000) GS:ffffa4153fc40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000036 CR3: 000000296a1bc002 CR4: 0000000000770ee0
> PKRU: 55555554
> Call Trace:
> <TASK>
> ? __die_body.cold (arch/x86/kernel/dumpstack.c:478
> arch/x86/kernel/dumpstack.c:465 arch/x86/kernel/dumpstack.c:420)
> ? page_fault_oops (arch/x86/mm/fault.c:727)
> ? migrate_task_rq_fair (include/linux/sched.h:1921
> kernel/sched/fair.c:3932 kernel/sched/fair.c:7497)
> ? do_user_addr_fault (include/linux/kprobes.h:404
> include/linux/kprobes.h:597 arch/x86/mm/fault.c:1280)
> ? ttwu_queue_wakelist (kernel/sched/core.c:3880)
> ? exc_page_fault (arch/x86/include/asm/irqflags.h:40
> arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1527
> arch/x86/mm/fault.c:1575)
> ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
> ? __filemap_get_folio (arch/x86/include/asm/atomic.h:29
> include/linux/atomic/atomic-arch-fallback.h:1242
> include/linux/atomic/atomic-arch-fallback.h:1267
> include/linux/atomic/atomic-instrumented.h:608
> include/linux/page_ref.h:238 include/linux/page_ref.h:247
> include/linux/page_ref.h:280 include/linux/page_ref.h:313
> mm/filemap.c:1863 mm/filemap.c:1915)
> filemap_fault (mm/filemap.c:3120)
> ? preempt_count_add (include/linux/ftrace.h:950
> kernel/sched/core.c:5685 kernel/sched/core.c:5682
> kernel/sched/core.c:5710)
> __do_fault (mm/memory.c:4234)
> do_fault (mm/memory.c:4564 mm/memory.c:4692)
> __handle_mm_fault (mm/memory.c:4964 mm/memory.c:5106)
> handle_mm_fault (mm/memory.c:5227)
> do_user_addr_fault (include/linux/sched/signal.h:433
> arch/x86/mm/fault.c:1430)
> exc_page_fault (arch/x86/include/asm/irqflags.h:40
> arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1527
> arch/x86/mm/fault.c:1575)
> asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
> RIP: 0033:0x268b8b9
> Code: 70 48 89 4c 24 78 48 8b 94 24 b8 00 00 00 0f 1f 00 48 85 d2
> 74 3f 48 89 ce 48 29 d9 4c 8d 49 04 49 f7 d9 49 c1 f9 3f 49 21 f9 <46>
> 8b 0c 08 44 89 4c 24 34 90 90 48 89 d3 48 89 c1 41 b8 01 00 00
> All code
> ========
> 0: 70 48 jo 0x4a
> 2: 89 4c 24 78 mov %ecx,0x78(%rsp)
> 6: 48 8b 94 24 b8 00 00 mov 0xb8(%rsp),%rdx
> d: 00
> e: 0f 1f 00 nopl (%rax)
> 11: 48 85 d2 test %rdx,%rdx
> 14: 74 3f je 0x55
> 16: 48 89 ce mov %rcx,%rsi
> 19: 48 29 d9 sub %rbx,%rcx
> 1c: 4c 8d 49 04 lea 0x4(%rcx),%r9
> 20: 49 f7 d9 neg %r9
> 23: 49 c1 f9 3f sar $0x3f,%r9
> 27: 49 21 f9 and %rdi,%r9
> 2a:* 46 8b 0c 08 mov (%rax,%r9,1),%r9d <-- trapping
> instruction
> 2e: 44 89 4c 24 34 mov %r9d,0x34(%rsp)
> 33: 90 nop
> 34: 90 nop
> 35: 48 89 d3 mov %rdx,%rbx
> 38: 48 89 c1 mov %rax,%rcx
> 3b: 41 rex.B
> 3c: b8 .byte 0xb8
> 3d: 01 00 add %eax,(%rax)
> ...
>
> Code starting with the faulting instruction
> ===========================================
> 0: 46 8b 0c 08 mov (%rax,%r9,1),%r9d
> 4: 44 89 4c 24 34 mov %r9d,0x34(%rsp)
> 9: 90 nop
> a: 90 nop
> b: 48 89 d3 mov %rdx,%rbx
> e: 48 89 c1 mov %rax,%rcx
> 11: 41 rex.B
> 12: b8 .byte 0xb8
> 13: 01 00 add %eax,(%rax)
> ...
> RSP: 002b:000000cbc509f520 EFLAGS: 00010202
> RAX: 00007e81cf427e0c RBX: 00000000000222cc RCX: 00000000123817b2
> RDX: 000000c00001ac00 RSI: 00000000123a3a7e RDI: 00000000000222c8
> RBP: 000000cbc509f5b0 R08: 0000000003cb5910 R09: 00000000000222c8
> R10: 000000c4de3dea00 R11: 0000000000000123 R12: 0000000000000000
> R13: 0000000000000005 R14: 000000c83bad2340 R15: 0000010000000000
> </TASK>
> Modules linked in: xt_connlabel xt_MASQUERADE nf_conntrack_netlink
> xfrm_user xfrm_algo xt_addrtype br_netfilter bridge overlay zstd
> zstd_compress zram zsmalloc tun tcp_diag inet_diag raid0 md_mod essiv
> dm_crypt trusted asn1_encoder tee ip6table_filter ip6table_mangle
> ip6table_raw ip6table_security ip6table_nat ip6_tables xt_bpf
> xt_conntrack xt_multiport xt_set iptable_filter xt_NFLOG nfnetlink_log
> xt_connbytes xt_comment xt_connmark xt_statistic iptable_mangle xt_nat
> xt_tcpudp iptable_nat nf_nat xt_CT iptable_raw ip_set_hash_ip
> ip_set_hash_net ip_set nfnetlink sch_fq nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 8021q garp mrp stp llc bonding nvme_fabrics amd64_edac
> kvm_amd ipmi_ssif kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3
> acpi_ipmi mlx5_core aesni_intel ipmi_si mlxfw rapl xhci_pci nvme tls
> ipmi_devintf tiny_power_button psample nvme_core xhci_hcd i2c_piix4
> ccp ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables
> bcmcrypt(O)
> crypto_simd cryptd
> CR2: 0000000000000036
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29
> include/linux/atomic/atomic-arch-fallback.h:1242
> include/linux/atomic/atomic-arch-fallback.h:1267
> include/linux/atomic/atomic-instrumented.h:608
> include/linux/page_ref.h:238 include/linux/page_ref.h:247
> include/linux/page_ref.h:280 include/linux/page_ref.h:313
> mm/filemap.c:1863 mm/filemap.c:1915)
> Code: 10 e8 99 ac 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02
> 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b>
> 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34
> All code
> ========
> 0: 10 e8 adc %ch,%al
> 2: 99 cltd
> 3: ac lods %ds:(%rsi),%al
> 4: 84 00 test %al,(%rax)
> 6: 48 3d 06 04 00 00 cmp $0x406,%rax
> c: 49 89 c4 mov %rax,%r12
> f: 74 e2 je 0xfffffffffffffff3
> 11: 48 3d 02 04 00 00 cmp $0x402,%rax
> 17: 74 da je 0xfffffffffffffff3
> 19: 48 85 c0 test %rax,%rax
> 1c: 0f 84 2e 02 00 00 je 0x250
> 22: a8 01 test $0x1,%al
> 24: 0f 85 e3 00 00 00 jne 0x10d
> 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction
> 2d: 85 c0 test %eax,%eax
> 2f: 74 c2 je 0xfffffffffffffff3
> 31: 8d 50 01 lea 0x1(%rax),%edx
> 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
>
> Code starting with the faulting instruction
> ===========================================
> 0: 8b 40 34 mov 0x34(%rax),%eax
> 3: 85 c0 test %eax,%eax
> 5: 74 c2 je 0xffffffffffffffc9
> 7: 8d 50 01 lea 0x1(%rax),%edx
> a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
> RSP: 0000:ffffaf5587cdfc60 EFLAGS: 00010246
> RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000002
> RDX: 0000000000000008 RSI: ffffa45181fa8000 RDI: ffffaf5587cdfc70
> RBP: 0000000000000000 R08: 0000000000000402 R09: 000000000006e44f
> R10: 000000000006e450 R11: 000000000006e448 R12: 0000000000000002
> R13: ffffa3fff6fdfeb0 R14: 000000000006e44a R15: 00000000000000d1
> FS: 000000c9e385ac90(0000) GS:ffffa4153fc40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000036 CR3: 000000296a1bc002 CR4: 0000000000770ee0
> PKRU: 55555554
>
> BUG: kernel NULL pointer dereference, address: 0000000000000076
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 7acd78067 P4D 7acd78067 PUD 7acd79067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 93 PID: 3784417 Comm: prometheus Tainted: G O
> 6.1.20-cloudflare-2023.3.18 #1
> Hardware name: GIGABYTE R162-Z13-CD/MZ12-HD2-CD, BIOS R13 07/17/2020
> RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29
> include/linux/atomic/atomic-arch-fallback.h:1242
> include/linux/atomic/atomic-arch-fallback.h:1267
> include/linux/atomic/atomic-instrumented.h:608
> include/linux/page_ref.h:238 include/linux/page_ref.h:247
> include/linux/page_ref.h:280 include/linux/page_ref.h:313
> mm/filemap.c:1863 mm/filemap.c:1915)
> Code: 10 e8 b9 a4 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02
> 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b>
> 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34
> All code
> ========
> 0: 10 e8 adc %ch,%al
> 2: b9 a4 84 00 48 mov $0x480084a4,%ecx
> 7: 3d 06 04 00 00 cmp $0x406,%eax
> c: 49 89 c4 mov %rax,%r12
> f: 74 e2 je 0xfffffffffffffff3
> 11: 48 3d 02 04 00 00 cmp $0x402,%rax
> 17: 74 da je 0xfffffffffffffff3
> 19: 48 85 c0 test %rax,%rax
> 1c: 0f 84 2e 02 00 00 je 0x250
> 22: a8 01 test $0x1,%al
> 24: 0f 85 e3 00 00 00 jne 0x10d
> 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction
> 2d: 85 c0 test %eax,%eax
> 2f: 74 c2 je 0xfffffffffffffff3
> 31: 8d 50 01 lea 0x1(%rax),%edx
> 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
>
> Code starting with the faulting instruction
> ===========================================
> 0: 8b 40 34 mov 0x34(%rax),%eax
> 3: 85 c0 test %eax,%eax
> 5: 74 c2 je 0xffffffffffffffc9
> 7: 8d 50 01 lea 0x1(%rax),%edx
> a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
> RSP: 0000:ffffb15106683c60 EFLAGS: 00010246
> RAX: 0000000000000042 RBX: 0000000000000000 RCX: 0000000000000002
> RDX: 0000000000000018 RSI: ffff934b0029efc8 RDI: ffffb15106683c70
> RBP: 0000000000000000 R08: 0000000000000402 R09: 00000000000cbe5f
> R10: 00000000000cbe60 R11: 00000000000cbe5c R12: 0000000000000042
> R13: ffff93449c251eb0 R14: 00000000000cbe59 R15: 00000000000000d1
> FS: 000000c000300090(0000) GS:ffff937e6ed40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000076 CR3: 0000000a6528e000 CR4: 0000000000350ee0
> Call Trace:
> <TASK>
> filemap_fault (mm/filemap.c:3120)
> ? preempt_count_add (include/linux/ftrace.h:950
> kernel/sched/core.c:5685 kernel/sched/core.c:5682
> kernel/sched/core.c:5710)
> __do_fault (mm/memory.c:4234)
> do_fault (mm/memory.c:4564 mm/memory.c:4692)
> __handle_mm_fault (mm/memory.c:4964 mm/memory.c:5106)
> handle_mm_fault (mm/memory.c:5227)
> do_user_addr_fault (include/linux/sched/signal.h:433
> arch/x86/mm/fault.c:1430)
> exc_page_fault (arch/x86/include/asm/irqflags.h:40
> arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1527
> arch/x86/mm/fault.c:1575)
> asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
> RIP: 0033:0x268b8b9
> Code: 70 48 89 4c 24 78 48 8b 94 24 b8 00 00 00 0f 1f 00 48 85 d2
> 74 3f 48 89 ce 48 29 d9 4c 8d 49 04 49 f7 d9 49 c1 f9 3f 49 21 f9 <46>
> 8b 0c 08 44 89 4c 24 34 90 90 48 89 d3 48 89 c1 41 b8 01 00 00
> All code
> ========
> 0: 70 48 jo 0x4a
> 2: 89 4c 24 78 mov %ecx,0x78(%rsp)
> 6: 48 8b 94 24 b8 00 00 mov 0xb8(%rsp),%rdx
> d: 00
> e: 0f 1f 00 nopl (%rax)
> 11: 48 85 d2 test %rdx,%rdx
> 14: 74 3f je 0x55
> 16: 48 89 ce mov %rcx,%rsi
> 19: 48 29 d9 sub %rbx,%rcx
> 1c: 4c 8d 49 04 lea 0x4(%rcx),%r9
> 20: 49 f7 d9 neg %r9
> 23: 49 c1 f9 3f sar $0x3f,%r9
> 27: 49 21 f9 and %rdi,%r9
> 2a:* 46 8b 0c 08 mov (%rax,%r9,1),%r9d <-- trapping
> instruction
> 2e: 44 89 4c 24 34 mov %r9d,0x34(%rsp)
> 33: 90 nop
> 34: 90 nop
> 35: 48 89 d3 mov %rdx,%rbx
> 38: 48 89 c1 mov %rax,%rcx
> 3b: 41 rex.B
> 3c: b8 .byte 0xb8
> 3d: 01 00 add %eax,(%rax)
> ...
>
> Code starting with the faulting instruction
> ===========================================
> 0: 46 8b 0c 08 mov (%rax,%r9,1),%r9d
> 4: 44 89 4c 24 34 mov %r9d,0x34(%rsp)
> 9: 90 nop
> a: 90 nop
> b: 48 89 d3 mov %rdx,%rbx
> e: 48 89 c1 mov %rax,%rcx
> 11: 41 rex.B
> 12: b8 .byte 0xb8
> 13: 01 00 add %eax,(%rax)
> ...
> RSP: 002b:000000d735bb3558 EFLAGS: 00010206
> RAX: 00007c018402dad8 RBX: 000000000002c3d8 RCX: 0000000037f9be1c
> RDX: 000000c000222c00 RSI: 0000000037fc81f4 RDI: 000000000002c3d4
> RBP: 000000d735bb35e8 R08: 0000000003cb5910 R09: 000000000002c3d4
> R10: 000000c385d2a000 R11: 0000000000000021 R12: 0000000000000000
> R13: 000000000000000b R14: 000000d1bb70e340 R15: 0000000001000000
> </TASK>
> Modules linked in: veth xt_MASQUERADE nf_conntrack_netlink
> xfrm_user xfrm_algo xt_addrtype br_netfilter bridge overlay raid1
> md_mod essiv dm_crypt trusted tee asn1_encoder xt_hl ip6table_filter
> ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables
> xt_tcpudp xt_conntrack xt_comment xt_multiport xt_set iptable_filter
> iptable_mangle iptable_nat nf_nat xt_CT iptable_raw ip_set_hash_ip
> ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 8021q mrp garp stp llc bonding
> amd64_edac kvm_amd ipmi_ssif kvm irqbypass crc32_pclmul crc32c_intel
> mlx5_core sha512_ssse3 psample acpi_ipmi aesni_intel xhci_pci nvme
> ipmi_si rapl tls ipmi_devintf tiny_power_button nvme_core mlxfw
> xhci_hcd i2c_piix4 ccp ipmi_msghandler button fuse dm_mod dax efivarfs
> ip_tables x_tables bcmcrypt(O) crypto_simd cryptd
> CR2: 0000000000000076
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29
> include/linux/atomic/atomic-arch-fallback.h:1242
> include/linux/atomic/atomic-arch-fallback.h:1267
> include/linux/atomic/atomic-instrumented.h:608
> include/linux/page_ref.h:238 include/linux/page_ref.h:247
> include/linux/page_ref.h:280 include/linux/page_ref.h:313
> mm/filemap.c:1863 mm/filemap.c:1915)
> Code: 10 e8 b9 a4 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02
> 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b>
> 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34
> All code
> ========
> 0: 10 e8 adc %ch,%al
> 2: b9 a4 84 00 48 mov $0x480084a4,%ecx
> 7: 3d 06 04 00 00 cmp $0x406,%eax
> c: 49 89 c4 mov %rax,%r12
> f: 74 e2 je 0xfffffffffffffff3
> 11: 48 3d 02 04 00 00 cmp $0x402,%rax
> 17: 74 da je 0xfffffffffffffff3
> 19: 48 85 c0 test %rax,%rax
> 1c: 0f 84 2e 02 00 00 je 0x250
> 22: a8 01 test $0x1,%al
> 24: 0f 85 e3 00 00 00 jne 0x10d
> 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction
> 2d: 85 c0 test %eax,%eax
> 2f: 74 c2 je 0xfffffffffffffff3
> 31: 8d 50 01 lea 0x1(%rax),%edx
> 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
>
> Code starting with the faulting instruction
> ===========================================
> 0: 8b 40 34 mov 0x34(%rax),%eax
> 3: 85 c0 test %eax,%eax
> 5: 74 c2 je 0xffffffffffffffc9
> 7: 8d 50 01 lea 0x1(%rax),%edx
> a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15
> f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12)
> RSP: 0000:ffffb15106683c60 EFLAGS: 00010246
> RAX: 0000000000000042 RBX: 0000000000000000 RCX: 0000000000000002
> RDX: 0000000000000018 RSI: ffff934b0029efc8 RDI: ffffb15106683c70
> RBP: 0000000000000000 R08: 0000000000000402 R09: 00000000000cbe5f
> R10: 00000000000cbe60 R11: 00000000000cbe5c R12: 0000000000000042
> R13: ffff93449c251eb0 R14: 00000000000cbe59 R15: 00000000000000d1
> FS: 000000c000300090(0000) GS:ffff937e6ed40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000076 CR3: 0000000a6528e000 CR4: 0000000000350ee0
> note: prometheus[3784417] exited with irqs disabled
>
> 2. Kernel NULL pointer deferencences in xfs_read_iomap_begin
>
> BUG: unable to handle page fault for address: 0000000000034668
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 11cfd37067 P4D 11cfd37067 PUD b88086067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 124 PID: 3831226 Comm: rocksdb:low Kdump: loaded Tainted: G
> W O L 6.1.27-cloudflare-2023.5.0 #1
> Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
> RIP: 0010:xfs_read_iomap_begin (fs/xfs/xfs_iomap.c:1200)
> Code: 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 50 48
> 89 14 24 4c 89 44 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 48 <48>
> 8b 87 >
> All code
> ========
> 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 5: 41 57 push %r15
> 7: 41 56 push %r14
> 9: 41 55 push %r13
> b: 41 54 push %r12
> d: 55 push %rbp
> e: 53 push %rbx
> f: 48 83 ec 50 sub $0x50,%rsp
> 13: 48 89 14 24 mov %rdx,(%rsp)
> 17: 4c 89 44 24 08 mov %r8,0x8(%rsp)
> 1c: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax
> 23: 00 00
> 25: 48 89 44 24 48 mov %rax,0x48(%rsp)
> 2a:* 48 rex.W <-- trapping instruction
> 2b: 8b .byte 0x8b
> 2c: 87 00 xchg %eax,(%rax)
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 rex.W
> 1: 8b .byte 0x8b
> 2: 87 00 xchg %eax,(%rax)
> RSP: 0018:ffffa63810733a70 EFLAGS: 00010282
> RAX: 78ac714f0997e100 RBX: ffffa63810733b40 RCX: 0000000000000000
> RDX: 0000000000004000 RSI: 0000000000000000 RDI: 00000000000347a0
> RBP: ffffffff8664d950 R08: ffffa63810733b68 R09: ffffa63810733bb0
> R10: 000000000001f627 R11: 0000000000000000 R12: ffffa63810733b68
> R13: ffffa63810733bb0 R14: 00000000000019c1 R15: 00000000fffffff5
> FS: 00007f48d8504700(0000) GS:ffffa2fe5ef00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000034668 CR3: 00000013037ec001 CR4: 0000000000770ee0
> PKRU: 55555554
> Call Trace:
> <TASK>
> ? __mod_memcg_lruvec_state (mm/memcontrol.c:613 mm/memcontrol.c:799)
> iomap_iter (fs/iomap/iter.c:76)
> iomap_read_folio (fs/iomap/buffered-io.c:342)
> ? xfs_end_bio (fs/xfs/xfs_aops.c:542)
> filemap_read_folio (mm/filemap.c:2407)
> filemap_get_pages (mm/filemap.c:2492 mm/filemap.c:2606)
> filemap_read (mm/filemap.c:2677)
> xfs_file_buffered_read (fs/xfs/xfs_file.c:278)
> xfs_file_read_iter (fs/xfs/xfs_file.c:304)
> vfs_read (fs/read_write.c:390 fs/read_write.c:470)
> __x64_sys_pread64 (include/linux/file.h:44 fs/read_write.c:666
> fs/read_write.c:675 fs/read_write.c:672 fs/read_write.c:672)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> RIP: 0033:0x7f49061ca917
> Code: 08 89 3c 24 48 89 4c 24 18 e8 05 f4 ff ff 4c 8b 54 24 18 48
> 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48>
> 3d 00 >
> All code
> ========
> 0: 08 89 3c 24 48 89 or %cl,-0x76b7dbc4(%rcx)
> 6: 4c 24 18 rex.WR and $0x18,%al
> 9: e8 05 f4 ff ff call 0xfffffffffffff413
> e: 4c 8b 54 24 18 mov 0x18(%rsp),%r10
> 13: 48 8b 54 24 10 mov 0x10(%rsp),%rdx
> 18: 41 89 c0 mov %eax,%r8d
> 1b: 48 8b 74 24 08 mov 0x8(%rsp),%rsi
> 20: 8b 3c 24 mov (%rsp),%edi
> 23: b8 11 00 00 00 mov $0x11,%eax
> 28: 0f 05 syscall
> 2a:* 48 rex.W <-- trapping instruction
> 2b: 3d .byte 0x3d
> ...
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 rex.W
> 1: 3d .byte 0x3d
> ...
> RSP: 002b:00007f48d84ffc70 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
> RAX: ffffffffffffffda RBX: 00000000018a0c90 RCX: 00007f49061ca917
> RDX: 00000000000c294f RSI: 000000002265e000 RDI: 000000000000003c
> RBP: 00007f48d84ffda0 R08: 0000000000000000 R09: 00007f48d84ffe60
> R10: 000000000191dcd8 R11: 0000000000000293 R12: 0000000007c3c6c0
> R13: 00000000000c294f R14: 00000000000c294f R15: 000000000191dcd8
> </TASK>
> Modules linked in: xt_connlabel overlay nft_compat esp4
> xt_hashlimit ip_set_hash_netport xt_length nf_conntrack_netlink
> mpls_gso mpls_iptunnel >
> tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 8021q
> garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass
> crc32_pclmul crc32>
> CR2: 0000000000034668
> ---[ end trace 0000000000000000 ]---
>
> We also have a deadlock reading a very specific file on this host. We managed to
> do a kdump on this host and extracted out the state of the mapping.
>
>
> >>> trace
> #0 context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5241:2)
> #1 __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6554:8)
> #2 schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6630:3)
> #3 io_schedule (/cfsetup_build/build/linux/kernel/sched/core.c:8774:2)
> #4 folio_wait_bit_common (/cfsetup_build/build/linux/mm/filemap.c:1296:4)
> #5 folio_put_wait_locked (/cfsetup_build/build/linux/mm/filemap.c:1465:9)
> #6 filemap_update_page (/cfsetup_build/build/linux/mm/filemap.c:2472:4)
> #7 filemap_get_pages (/cfsetup_build/build/linux/mm/filemap.c:2606:9)
> #8 filemap_read (/cfsetup_build/build/linux/mm/filemap.c:2676:11)
> #9 xfs_file_buffered_read
> (/cfsetup_build/build/linux/fs/xfs/xfs_file.c:277:8)
> #10 xfs_file_read_iter (/cfsetup_build/build/linux/fs/xfs/xfs_file.c:302:9)
> #11 call_read_iter (/cfsetup_build/build/linux/include/linux/fs.h:2199:9)
> #12 new_sync_read (/cfsetup_build/build/linux/fs/read_write.c:389:8)
> #13 vfs_read (/cfsetup_build/build/linux/fs/read_write.c:470:9)
> #14 ksys_read (/cfsetup_build/build/linux/fs/read_write.c:613:9)
> #15 do_syscall_x64
> (/cfsetup_build/build/linux/arch/x86/entry/common.c:50:14)
> #16 do_syscall_64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:80:7)
> #17 entry_SYSCALL_64+0x83/0x164
> (/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:120)
> #18 0x7f05f0b093ce
> >>> folio = trace[6]['folio']
> >>> decode_page_flags(folio)
> 'PG_locked|PG_waiters|PG_head'
> >>> folio
> *(struct folio *)0xffffd67406346000 = {
> .flags = (unsigned long)13510764522438785,
> .lru = (struct list_head){
> .next = (struct list_head *)0xdead000000000100,
> .prev = (struct list_head *)0xdead000000000122,
> },
> .__filler = (void *)0xdead000000000100,
> .mlock_count = (unsigned int)290,
> .mapping = (struct address_space *)0x0,
> .index = (unsigned long)18446641474676726016,
> .private = (void *)0x400000,
> ._mapcount = (atomic_t){
> .counter = (int)-1,
> },
> ._refcount = (atomic_t){
> .counter = (int)1,
> },
> .memcg_data = (unsigned long)0,
> .page = (struct page){
> .flags = (unsigned long)13510764522438785,
> .lru = (struct list_head){
> .next = (struct list_head *)0xdead000000000100,
> .prev = (struct list_head *)0xdead000000000122,
> },
> .__filler = (void *)0xdead000000000100,
> .mlock_count = (unsigned int)290,
> .buddy_list = (struct list_head){
> .next = (struct list_head *)0xdead000000000100,
> .prev = (struct list_head *)0xdead000000000122,
> },
> .pcp_list = (struct list_head){
> .next = (struct list_head *)0xdead000000000100,
> .prev = (struct list_head *)0xdead000000000122,
> },
> .mapping = (struct address_space *)0x0,
> .index = (unsigned long)18446641474676726016,
> .private = (unsigned long)4194304,
> .pp_magic = (unsigned long)16045481047390945536,
> .pp = (struct page_pool *)0xdead000000000122,
> ._pp_mapping_pad = (unsigned long)0,
> .dma_addr = (unsigned long)18446641474676726016,
> .dma_addr_upper = (unsigned long)4194304,
> .pp_frag_count = (atomic_long_t){
> .counter = (s64)4194304,
> },
> .compound_head = (unsigned long)16045481047390945536,
> .compound_dtor = (unsigned char)34,
> .compound_order = (unsigned char)1,
> .compound_mapcount = (atomic_t){
> .counter = (int)-559087616,
> },
> .compound_pincount = (atomic_t){
> .counter = (int)0,
> },
> .compound_nr = (unsigned int)0,
> ._compound_pad_1 = (unsigned long)16045481047390945536,
> ._compound_pad_2 = (unsigned long)16045481047390945570,
> .deferred_list = (struct list_head){
> .next = (struct list_head *)0x0,
> .prev = (struct list_head *)0xffffa2afcd181900,
> },
> ._pt_pad_1 = (unsigned long)16045481047390945536,
> .pmd_huge_pte = (pgtable_t)0xdead000000000122,
> ._pt_pad_2 = (unsigned long)0,
> .pt_mm = (struct mm_struct *)0xffffa2afcd181900,
> .pt_frag_refcount = (atomic_t){
> .counter = (int)-854058752,
> },
> .ptl = (spinlock_t){
> .rlock = (struct raw_spinlock){
> .raw_lock = (arch_spinlock_t){
> .val = (atomic_t){
> .counter = (int)4194304,
> },
> .locked = (u8)0,
> .pending = (u8)0,
> .locked_pending = (u16)0,
> .tail = (u16)64,
> },
> },
> },
> .pgmap = (struct dev_pagemap *)0xdead000000000100,
> .zone_device_data = (void *)0xdead000000000122,
> .callback_head = (struct callback_head){
> .next = (struct callback_head *)0xdead000000000100,
> .func = (void (*)(struct callback_head
> *))0xdead000000000122,
> },
> ._mapcount = (atomic_t){
> .counter = (int)-1,
> },
> .page_type = (unsigned int)4294967295,
> ._refcount = (atomic_t){
> .counter = (int)1,
> },
> .memcg_data = (unsigned long)0,
> },
> ._flags_1 = (unsigned long)13510764522373120,
> .__head = (unsigned long)18446698392541487105,
> ._folio_dtor = (unsigned char)1,
> ._folio_order = (unsigned char)2,
> ._total_mapcount = (atomic_t){
> .counter = (int)-1,
> },
> ._pincount = (atomic_t){
> .counter = (int)0,
> },
> ._folio_nr_pages = (unsigned int)4,
> }
> >>> for index, entry in
> xa_for_each(trace[7]['mapping'].i_pages.address_of_()):
> print(index, entry, cast('struct folio *',
> entry).page.mapping.address_of_())
> ....
> 6464 (void *)0xffffd674c130a000 *(struct address_space
> **)0xffffd674c130a018 = 0xffffa2b30e93b2b0
> 6528 (void *)0xffffd674beb22000 *(struct address_space
> **)0xffffd674beb22018 = 0xffffa2b30e93b2b0
> 6592 (void *)0xffffd67406346000 *(struct address_space
> **)0xffffd67406346018 = 0x0 <===== our folio
> 6624 (void *)0x7037e8d8000100d (struct address_space **)0x7037e8d80001025
> 6625 (void *)0x7037e047000100d (struct address_space **)0x7037e0470001025
> ....
>
> This looks like the xarray is corrupted, and for some reason we have a
> locked folio
> in the mapping with a page with no mapping.
>
> Any suggestions on narrowing this down to a hypothesis to try to reproduce this,
> or potential fixes are very much appreciated. We are also trying some
> different kernels
> configurations on different set of hosts to see if the problems go
> away for them, such as:
> - 6.1.36 without xfs: Support large folios
> 6795801366da0cd3d99e27c37f020a8f16714886
> - 6.1.36 without THP
> - 6.1.37 with the following series backported xfs, iomap: fix data
> corruption due to stale cached iomaps
> https://lore.kernel.org/linux-fsdevel/20221129001632.GX3600936@xxxxxxxxxxxxxxxxxxx/
>
> Best,
> Daniel.