Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations

From: Oscar Salvador
Date: Tue Apr 08 2025 - 09:59:43 EST


On Wed, Apr 02, 2025 at 08:56:13PM +0000, Frank van der Linden wrote:
> Hugetlb boot allocation has used online nodes for allocation since
> commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
> allocation"). This was needed to be able to do the allocations
> earlier in boot, before N_MEMORY was set.
>
> This might lead to a different distribution of gigantic hugepages
> across NUMA nodes if there are memoryless nodes in the system.
>
> What happens is that the memoryless nodes are tried, but then
> the memblock allocation fails and falls back, which usually means
> that the node that has the highest physical address available
> will be used (top-down allocation). While this will end up
> getting the same number of hugetlb pages, they might not be
> be distributed the same way. The fallback for each memoryless
> node might not end up coming from the same node as the
> successful round-robin allocation from N_MEMORY nodes.
>
> While administrators that rely on having a specific number of
> hugepages per node should use the hugepages=N:X syntax, it's
> better not to change the old behavior for the plain hugepages=N
> case.
>
> To do this, construct a nodemask for hugetlb bootmem purposes
> only, containing nodes that have memory. Then use that
> for round-robin bootmem allocations.
>
> This saves some cycles, and the added advantage here is that
> hugetlb_cma can use it too, avoiding the older issue of
> pointless attempts to create a CMA area for memoryless nodes
> (which will also cause the per-node CMA area size to be too
> small).

Hi Frank,

Makes sense.

There something I do not quite understand though

> @@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void)
>
> for_each_hstate(h) {
> h->next_nid_to_alloc = first_online_node;
> - h->next_nid_to_free = first_online_node;

Why are you unsetting next_nid_to_free? I guess it is because
we do not use it during boot time and you already set it to
first_memory_node further down the road in hugetlb_init_hstates.

And the reason you are leaving next_nid_to_alloc set is to see if
there is any chance that first_online_node is part of hugetlb_bootmem_nodes?


--
Oscar Salvador
SUSE Labs