Re: kernel BUG in page_add_anon_rmap

From: David Hildenbrand
Date: Mon Jan 30 2023 - 04:04:33 EST



I reproduced on next-20230127 (did not try upstream yet).

Upstream's fine; on next-20230127 (with David's repro) it bisects to
5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
hand on to Liam, rather than delay you by puzzling over it further myself.


Thanks for identifying the problematic commit! ...


I think two key things are that a) THP are set to "always" and b) we have a
NUMA setup [I assume].

The relevant bits:

[ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
mapping:0000000000000000 index:0x20003 pfn:0x14ee03
[ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
nr_pages_mapped:511 pincount:0
[ 439.899611] memcg:ffff986dc4689000
[ 439.902207] anon flags:
0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
[ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
dead000000000400
[ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
0000000000000000
[ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
ffff98714338a001
[ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
ffff986dc4689000
[ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
rmap_t)((((1UL))) << (0)))))


Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
mapped into a single
page table (no fork() or similar).

Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
index:0x20003?

I was assuming the whole folio would get migrated. As you raise below, it's all a bit unclear once THP get involved and dealing with mbind() and page migration.


I created this reduced reproducer that triggers 100%:

Very helpful, thank you.



#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>
#include <numaif.h>

int main(void)
{
mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);

*(uint32_t*)0x20000080 = 0x80000;
mlock((void*)0x20001000ul, 0x2000ul);
mlock((void*)0x20000000ul, 0x3000ul);

It's not an mlock() issue in particular: quickly established by
substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
Looks like a vma splitting issue now.

Gah, should have tried something like that first before suspecting it's mlock related. :)


mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
MPOL_MF_MOVE);

I guess it will turn out not to be relevant to this particular syzbug,
but what do we expect an mbind() of just 0x1000 of a THP to do?

It's a subject I've wrestled with unsuccessfully in the past: I found
myself arriving at one conclusion (split THP) in one place, and a contrary
conclusion (widen range) in another place, and never had time to work out
one unified answer.

I'm aware of a similar issue with long-term page pinning: we might want to pin a 4k portion of a THP, but will end up blocking the whole THP from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I wrote a reproducer [1] a while ago to show how you can effectively steal most THP in the system using comparatively small memlock limit using io_uring ...

In theory, we could split the THP before long-term pinning only a subregion ... but what if we cannot split the THP because it's already pinned (previous pinning request that covered the whole THP)? Copying instead of splitting would also not be possible, if the page is already pinned ... so we'd never want to allow long-term pinning a THP ... but that means that we would have to fail pinning if splitting the THP fails and that there would be performance-consequences for THP users :/

Non-trivial ... just like mlocking only a part of a THP or mbinding different parts of a THP to different nodes ...

[1] https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c

--
Thanks,

David / dhildenb