Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

From: David Hildenbrand
Date: Fri Jul 03 2020 - 07:32:37 EST


On 03.07.20 12:59, Michal Hocko wrote:
> On Fri 03-07-20 11:24:17, Michal Hocko wrote:
>> [Cc Andi]
>>
>> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
>>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
>>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
>> [...]
>>>>> Yep, looks like it.
>>>>>
>>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
>>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
>>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
>>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
>>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
>>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
>>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
>>>>
>>>> This begs a question whether ppc can do the same thing?
>>> Or x86 stop doing it so that you can see on what node you are running?
>>>
>>> What's the point of this indirection other than another way of avoiding
>>> empty node 0?
>>
>> Honestly, I do not have any idea. I've traced it down to
>> Author: Andi Kleen <ak@xxxxxxx>
>> Date: Tue Jan 11 15:35:48 2005 -0800
>>
>> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>>
>> Fix fallout from the recent nodemask_t changes. The node ids assigned
>> in the SRAT parser were off by one.
>>
>> I added a new first_unset_node() function to nodemask.h to allocate
>> IDs sanely.
>>
>> Signed-off-by: Andi Kleen <ak@xxxxxxx>
>> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>
>>
>> which doesn't really tell all that much. The historical baggage and a
>> long term behavior which is not really trivial to fix I suspect.
>
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
>
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
>

I think the ACPI code takes care of properly mapping PXM to nodes.

So if I start with PXM 0 empty and PXM 1 populated, I will get
PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU

$ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
$ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor

$ echo "info numa" | sudo nc -U /var/tmp/monitor
QEMU 5.0.50 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus:
node 0 size: 1024 MB
node 0 plugged: 1024 MB
node 1 cpus: 0 1 2 3
node 1 size: 4096 MB
node 1 plugged: 0 MB

I get in the guest:

[ 50.174435] ------------[ cut here ]------------
[ 50.175436] node 1 was absent from the node_possible_map
[ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290
[ 50.176844] Modules linked in:
[ 50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4
[ 50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[ 50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 50.176847] RIP: 0010:add_memory_resource+0x8c/0x290
[ 50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78
[ 50.176849] RSP: 0018:ffffa7a1c0043d48 EFLAGS: 00010296
[ 50.176850] RAX: 000000000000002c RBX: ffff8bc633e63b80 RCX: 0000000000000000
[ 50.176851] RDX: ffff8bc63bc27060 RSI: ffff8bc63bc18d00 RDI: ffff8bc63bc18d00
[ 50.176851] RBP: 0000000000000001 R08: 00000000000001e1 R09: ffffa7a1c0043bd8
[ 50.176852] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000140000000
[ 50.176852] R13: 000000017fffffff R14: 0000000040000000 R15: 0000000180000000
[ 50.176853] FS: 0000000000000000(0000) GS:ffff8bc63bc00000(0000) knlGS:0000000000000000
[ 50.176853] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.176855] CR2: 000055dfcbfc5ee8 CR3: 00000000aca0a000 CR4: 00000000000006f0
[ 50.176855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 50.176856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 50.176856] Call Trace:
[ 50.176856] __add_memory+0x33/0x70
[ 50.176857] acpi_memory_device_add+0x132/0x2f2
[ 50.176857] acpi_bus_attach+0xd2/0x200
[ 50.176858] acpi_bus_scan+0x33/0x70
[ 50.176858] acpi_device_hotplug+0x298/0x390
[ 50.176858] acpi_hotplug_work_fn+0x3d/0x50
[ 50.176859] process_one_work+0x1b4/0x370
[ 50.176859] worker_thread+0x53/0x3e0
[ 50.176860] ? process_one_work+0x370/0x370
[ 50.176860] kthread+0x119/0x140
[ 50.176860] ? __kthread_bind_mask+0x60/0x60
[ 50.176861] ret_from_fork+0x22/0x30
[ 50.176861] ---[ end trace 9a2a837c1e0164f1 ]---
[ 50.209816] acpi PNP0C80:00: add_memory failed
[ 50.210510] acpi PNP0C80:00: acpi_memory_enable_device() error
[ 50.211445] acpi PNP0C80:00: Enumeration failure


I remember that we added that check just recently (due to powerpc if I am not wrong).
Not sure why that triggers here.

But it properly maps PXM 0 to node 1.

--
Thanks,

David / dhildenb