Re: kernel BUG in page_add_anon_rmap

From: Hugh Dickins
Date: Fri Jan 27 2023 - 12:02:21 EST


On Fri, 27 Jan 2023, David Hildenbrand wrote:
> On 26.01.23 19:57, Matthew Wilcox wrote:
> > On Wed, Jan 25, 2023 at 11:59:16PM +0000, Sanan Hasanov wrote:
> >> Good day, dear maintainers,
> >>
> >> We found a bug using a modified kernel configuration file used by syzbot.
> >>
> >> We enhanced the coverage of the configuration file using our tool,
> >> klocalizer.
> >>
> >> Kernel Branch: 6.2.0-rc5-next-20230124
> >> Kernel
> >> config: https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
> >> Reproducer: https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing
> >>
> >> Thank you!
> >>
> >> Best regards,
> >> Sanan Hasanov
> >>
> >> head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
> >> page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL)))
> >> << (0)))))
> >> ------------[ cut here ]------------
> >
> > I know it says "cut here" and you did that, but including just a few
> > lines above that would be so much more helpful. I can infer that this
> > is a multi-page folio, but more than that is hard to tell.
> >
> >> kernel BUG at mm/rmap.c:1248!
> >
> > That tracks with VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> >
> >> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> >> CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124
> >> #1
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> >> 04/01/2014
> >> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> >> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8
> >> 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73
> >> c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> >> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> >> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> >> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> >> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> >> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> >> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> >> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000)
> >> knlGS:0000000000000000
> >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
> >> Call Trace:
> >> <TASK>
> >> remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261
> >
> > if (folio_test_anon(folio))
> > page_add_anon_rmap(new, vma, pvmw.address,
> > rmap_flags);
> >
> > Earlier in that function, we had:
> > if (folio_test_anon(folio) &&
> > !is_readable_migration_entry(entry))
> > rmap_flags |= RMAP_EXCLUSIVE;
> >
> > so that also makes sense. We can also infer that RMAP_COMPOUND wasn't
> > set, so we're trying to do just one page from the folio.
> >
> > All right, back to rmap.c:
> >
> > first = atomic_inc_and_test(&page->_mapcount);
> >
> > So first is clearly false (ie _mapcount was not -1), implying somebody
> > else already mapped this page. Not really sure what's going on at
> > this point. Seems unlikely that the folio changes in
> > remove_migration_pte() are responsible since they're from last January.
> > Huang has some more changes to migrate.c that I don't feel qualified
> > to judge.
> >
> > Nothing's jumping out at me as obviously wrong. Is it possible to
> > do a bisect?
>
> I reproduced on next-20230127 (did not try upstream yet).
>
> I think two key things are that a) THP are set to "always" and b) we have a
> NUMA setup [I assume].
>
> The relevant bits:
>
> [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
> mapping:0000000000000000 index:0x20003 pfn:0x14ee03
> [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
> nr_pages_mapped:511 pincount:0
> [ 439.899611] memcg:ffff986dc4689000
> [ 439.902207] anon flags:
> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
> [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
> dead000000000400
> [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
> 0000000000000000
> [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
> ffff98714338a001
> [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
> ffff986dc4689000
> [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
> rmap_t)((((1UL))) << (0)))))
>
>
> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
> mapped into a single
> page table (no fork() or similar).
>
> I created this reduced reproducer that triggers 100%:
>
>
> #include <stdint.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <numaif.h>
>
> int main(void)
> {
> mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
> MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
> madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
>
> *(uint32_t*)0x20000080 = 0x80000;
> mlock((void*)0x20001000ul, 0x2000ul);
> mlock((void*)0x20000000ul, 0x3000ul);
> mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
> MPOL_MF_MOVE);
> return 0;
> }
>
> We map a large-enough are for a THP and then populate a fresh anon THP (PMD
> mapped)
> to write to it.
>
> The first mlock() will trigger PTE-mapping the THP and mlocking that subpage.
> The second mlock() seems to cause the issue. The final mbind() triggers page
> migration.
>
> Removing one of the mlock() makes it work. Note that we do a double
> mlock() of the same page -- the one we are then trying to migrate.
>
> Somehow, the double mlock() of the same page seems to affect our mapcount.
>
> CCing Hugh.

Thanks David - most especially for the reproducer, not tried here yet.
I'll assume this is my bug, and get into it later in the day.

Hugh