Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

From: Michal Hocko
Date: Fri Jul 03 2020 - 07:46:09 EST


On Fri 03-07-20 13:32:21, David Hildenbrand wrote:
> On 03.07.20 12:59, Michal Hocko wrote:
> > On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> >> [Cc Andi]
> >>
> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> >> [...]
> >>>>> Yep, looks like it.
> >>>>>
> >>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> >>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> >>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> >>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> >>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> >>>>
> >>>> This begs a question whether ppc can do the same thing?
> >>> Or x86 stop doing it so that you can see on what node you are running?
> >>>
> >>> What's the point of this indirection other than another way of avoiding
> >>> empty node 0?
> >>
> >> Honestly, I do not have any idea. I've traced it down to
> >> Author: Andi Kleen <ak@xxxxxxx>
> >> Date: Tue Jan 11 15:35:48 2005 -0800
> >>
> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >>
> >> Fix fallout from the recent nodemask_t changes. The node ids assigned
> >> in the SRAT parser were off by one.
> >>
> >> I added a new first_unset_node() function to nodemask.h to allocate
> >> IDs sanely.
> >>
> >> Signed-off-by: Andi Kleen <ak@xxxxxxx>
> >> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>
> >>
> >> which doesn't really tell all that much. The historical baggage and a
> >> long term behavior which is not really trivial to fix I suspect.
> >
> > Thinking about this some more, this logic makes some sense afterall.
> > Especially in the world without memory hotplug which was very likely the
> > case back then. It is much better to have compact node mask rather than
> > sparse one. After all node numbers shouldn't really matter as long as
> > you have a clear mapping to the HW. I am not sure we export that
> > information (except for the kernel ring buffer) though.
> >
> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> >
>
> I think the ACPI code takes care of properly mapping PXM to nodes.
>
> So if I start with PXM 0 empty and PXM 1 populated, I will get
> PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU
>
> $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
> $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor
>
> $ echo "info numa" | sudo nc -U /var/tmp/monitor
> QEMU 5.0.50 monitor - type 'help' for more information
> (qemu) info numa
> 2 nodes
> node 0 cpus:
> node 0 size: 1024 MB
> node 0 plugged: 1024 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 4096 MB
> node 1 plugged: 0 MB

Thanks for double checking.

> I get in the guest:
>
> [ 50.174435] ------------[ cut here ]------------
> [ 50.175436] node 1 was absent from the node_possible_map
> [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290

This would mean that the ACPI code or whoever does the remaping is not
adding the new node into possible nodes.

[...]
> I remember that we added that check just recently (due to powerpc if I am not wrong).
> Not sure why that triggers here.

This was a misbehaving Qemu IIRC providing a garbage map.

--
Michal Hocko
SUSE Labs