Re: [PATCH v5 1/1] mm: refactor initialization of struct page for holes in memory layout

From: Mike Rapoport
Date: Tue Feb 16 2021 - 06:14:34 EST


On Mon, Feb 15, 2021 at 09:45:30AM +0100, David Hildenbrand wrote:
> On 14.02.21 18:29, Mike Rapoport wrote:
> > On Fri, Feb 12, 2021 at 10:56:19AM +0100, David Hildenbrand wrote:
> > > On 12.02.21 10:55, David Hildenbrand wrote:
> > > > On 08.02.21 12:08, Mike Rapoport wrote:
> > > > > +#ifdef CONFIG_SPARSEMEM
> > > > > + /*
> > > > > + * Sections in the memory map may not match actual populated
> > > > > + * memory, extend the node span to cover the entire section.
> > > > > + */
> > > > > + *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
> > > > > + *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
> > > >
> > > > Does that mean that we might create overlapping zones when one node
> > >
> > > s/overlapping zones/overlapping nodes/
> > >
> > > > starts in the middle of a section and the other one ends in the middle
> > > > of a section?
> > >
> > > > Could it be a problem? (e.g., would we have to look at neighboring nodes
> > > > when making the decision to extend, and how far to extend?)
> >
> > Having a node end/start in a middle of a section would be a problem, but in
> > this case I don't see a way to detect how a node should be extended :(
>
> Running QEMU with something like:
>
> ...
> -m 8G \
> -smp sockets=2,cores=2 \
> -object memory-backend-ram,id=bmem0,size=4160M \
> -object memory-backend-ram,id=bmem1,size=4032M \

This is an interesting setup :)

TBH, I've tried to think what physical configuration would be problematic
for the implicit node extension, and I had concerns about arm64 with it's
huge section size, but it entirely slipped my mind that a VM can have
really weird memory configuration.

> -numa node,nodeid=0,cpus=0-1,memdev=bmem0 -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
> ...
>
> Creates such a setup.
>
> With an older kernel:
>
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
> [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
> [...]
> [ 0.002506] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [ 0.002508] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> [ 0.002509] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff]
> [ 0.002510] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff]
> [ 0.002511] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
> [ 0.002513] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff]
> [ 0.002519] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff]
> [ 0.002669] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff]
> [ 0.017947] memblock: reserved range [0x0000000000000000-0x0000000000001000] is not in memory
> [ 0.017953] memblock: reserved range [0x000000000009f000-0x0000000000100000] is not in memory
> [ 0.017956] Zone ranges:
> [ 0.017957] DMA [mem 0x0000000000000000-0x0000000000ffffff]
> [ 0.017958] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> [ 0.017960] Normal [mem 0x0000000100000000-0x000000023fffffff]
> [ 0.017961] Device empty
> [ 0.017962] Movable zone start for each node
> [ 0.017964] Early memory node ranges
> [ 0.017965] node 0: [mem 0x0000000000000000-0x00000000bffdffff]
> [ 0.017966] node 0: [mem 0x0000000100000000-0x0000000143ffffff]
> [ 0.017967] node 1: [mem 0x0000000144000000-0x000000023fffffff]
> [ 0.017969] Initmem setup node 0 [mem 0x0000000000000000-0x0000000143ffffff]
> [ 0.017971] On node 0 totalpages: 1064928
> [ 0.017972] DMA zone: 64 pages used for memmap
> [ 0.017973] DMA zone: 21 pages reserved
> [ 0.017974] DMA zone: 4096 pages, LIFO batch:0
> [ 0.017994] DMA32 zone: 12224 pages used for memmap
> [ 0.017995] DMA32 zone: 782304 pages, LIFO batch:63
> [ 0.022281] DMA32: Zeroed struct page in unavailable ranges: 32
> [ 0.022286] Normal zone: 4352 pages used for memmap
> [ 0.022287] Normal zone: 278528 pages, LIFO batch:63
> [ 0.023769] Initmem setup node 1 [mem 0x0000000144000000-0x000000023fffffff]
> [ 0.023774] On node 1 totalpages: 1032192
> [ 0.023775] Normal zone: 16128 pages used for memmap
> [ 0.023775] Normal zone: 1032192 pages, LIFO batch:63
>
>
> With current next/master:
>
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
> [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
> [...]
> [ 0.002419] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [ 0.002421] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> [ 0.002422] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff]
> [ 0.002423] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff]
> [ 0.002424] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
> [ 0.002426] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff]
> [ 0.002432] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff]
> [ 0.002583] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff]
> [ 0.017722] Zone ranges:
> [ 0.017726] DMA [mem 0x0000000000000000-0x0000000000ffffff]
> [ 0.017728] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> [ 0.017729] Normal [mem 0x0000000100000000-0x000000023fffffff]
> [ 0.017731] Device empty
> [ 0.017732] Movable zone start for each node
> [ 0.017734] Early memory node ranges
> [ 0.017735] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.017736] node 0: [mem 0x0000000000100000-0x00000000bffdffff]
> [ 0.017737] node 0: [mem 0x0000000100000000-0x0000000143ffffff]
> [ 0.017738] node 1: [mem 0x0000000144000000-0x000000023fffffff]
> [ 0.017741] Initmem setup node 0 [mem 0x0000000000000000-0x0000000147ffffff]
> [ 0.017742] On node 0 totalpages: 1064830
> [ 0.017743] DMA zone: 64 pages used for memmap
> [ 0.017744] DMA zone: 21 pages reserved
> [ 0.017745] DMA zone: 3998 pages, LIFO batch:0
> [ 0.017765] DMA zone: 98 pages in unavailable ranges
> [ 0.017766] DMA32 zone: 12224 pages used for memmap
> [ 0.017766] DMA32 zone: 782304 pages, LIFO batch:63
> [ 0.022042] DMA32 zone: 32 pages in unavailable ranges
> [ 0.022046] Normal zone: 4608 pages used for memmap
> [ 0.022047] Normal zone: 278528 pages, LIFO batch:63
> [ 0.023601] Normal zone: 16384 pages in unavailable ranges
> [ 0.023606] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff]
> [ 0.023608] On node 1 totalpages: 1032192
> [ 0.023609] Normal zone: 16384 pages used for memmap
> [ 0.023609] Normal zone: 1032192 pages, LIFO batch:63
> [ 0.029267] Normal zone: 16384 pages in unavailable ranges
>
>
> In this setup, one node ends in the middle of a section (+64MB), the
> other one starts in the middle of the same section (+64MB).
>
> After your patch, the nodes overlap (in one section)
>
> I can spot that each node still has the same number of present pages and
> that each node now has exactly 64MB unavailable pages (the extra ones spanned).
>
> So at least here, it looks like the machinery is still doing the right thing?

So in this setup we'll have pages in the overlapping section initialized twice
and they will end linked to node1 which is not exactly correct, but we care
less about the nodes than about the zones. Well, at least we don't have
VM_BUG_ON(!node_spans_pfn()) :)

--
Sincerely yours,
Mike.