Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

From: Vlastimil Babka
Date: Thu Mar 12 2020 - 12:42:09 EST


On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@xxxxxxx> [2020-03-12 14:51:38]:
>
>> > * Vlastimil Babka <vbabka@xxxxxxx> [2020-03-12 10:30:50]:
>> >
>> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> wrote:
>> >> >> * Michal Hocko <mhocko@xxxxxxxxxx> [2020-03-11 12:57:35]:
>> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>> >> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >> >>>> proper node.
>> >> >>>
>> >> >>> Just curious, is this somehow related to
>> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@xxxxxxxxxxxxxx?
>> >> >>>
>> >> >>
>> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
>> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> >> >> shrinker_map on appropriate NUMA node").
>> >> >>
>> >
>> > While I am not an expert in the slub area, I looked at the patch
>> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
>> >
>> > On the system where the crash happens, the possible number of nodes is much
>> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
>> > available for onlined nodes.
>> >
>> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
>> > for all possible nodes and in ___slab_alloc we end up looking at the
>> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
>> > i.e for a node whose pdgat struct is not allocated, we are trying to
>> > dereference.
>>
>> From what we saw, the pgdat does exist, the problem is that slab's per-node data
>> doesn't exist for a node that doesn't have present pages, as it would be a waste
>> of memory.
>
> Just to be clear
> Before my 3 patches to fix dummy node:
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 0-1

OK

>>
>> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
>> Sachin's first report [1] we have
>>
>> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> [ 0.000000] numa: NODE_DATA(0) on node 1
>> [ 0.000000] numa: NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
>>
>
> So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
> rest 30 nodes.

I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should
check online first, as you suggested.

>> But in this thread, with your patches Sachin reports:
>
> and with my patches
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 1
>
>>
>> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>>
>
> so we only see one pgdat.
>
>> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
>>
>> [1]
>> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@xxxxxxxxxxxxxxxxxx/
>>
>> > Also for a memoryless/cpuless node or possible but not present nodes,
>> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
>>
>> I think that's the place where this would be best to fix.
>>
>
> Maybe. I thought about it but the current set_numa_mem semantics are apt
> for memoryless cpu node and not for possible nodes. We could have upto 256
> possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> node_to_mem_node seems to return what is set in set_numa_mem().
> set_numa_mem() seems to say set my numa_mem node for the current memoryless
> node to the param passed.
>
> But how do we set numa_mem for all the other 253 possible nodes, which
> probably will have 0 as default?
>
> Should we introduce another API such that we could update for all possible
> nodes?

If we want to rely on node_to_mem_node() to give us something safe for each
possible node, then probably it would have to be like that, yeah.

>> > I tried with this hunk below and it works.
>> >
>> > But I am not sure if we need to check at other places were
>> > node_present_pages is being called.
>>
>> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
>> return only nodes that are online with present memory?
>> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
>> support for node_to_mem_node() to determine the fallback node")
>>
>
> Agree
>
>> I think we do need well defined and documented rules around node_to_mem_node(),
>> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
>> everyone handles it the same, safe way.

So let's try to brainstorm how this would look like? What I mean are some rules
like below, even if some details in my current understanding are most likely
incorrect:

with nid present in:
N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
node with memory so that we don't require everyone to search for it in slightly
different ways
N_ONLINE - pgdat must exist, there doesn't have to be present memory,
node_to_mem_node() still has to return something else (?)
N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
N_HIGH_MEMORY - node has present high memory

>
> Other option would be to tweak Kirill Tkhai's patch such that we call
> kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> if the node is offline.

I really would like a solution that hides these ugly details from callers so
they don't have to workaround the APIs we provide. kvmalloc_node() really
shouldn't crash, and it should fallback automatically if we don't give it
__GFP_THISNODE

However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
wasteful on systems with 256 possible nodes and only few present, by allocating
effectively dead structures for each memcg.

SLUB tries to be smart, so it allocates the per-node per-cache structures only
when the node goes online in slab_mem_going_online_callback(). This is why
there's a crash when such non-existing structures are accessed for a node that's
not online, and why they shouldn't be accessed.

Perhaps memcg should do the same on-demand allocation, if possible.

>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index 626cbcbd977f..bddb93bed55e 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> > if (unlikely(!node_match(page, node))) {
>> > int searchnode = node;
>> >
>> > - if (node != NUMA_NO_NODE && !node_present_pages(node))
>> > - searchnode = node_to_mem_node(node);
>> > -
>> > + if (node != NUMA_NO_NODE) {
>> > + if (!node_online(node) || !node_present_pages(node)) {
>> > + searchnode = node_to_mem_node(node);
>> > + if (!node_online(searchnode))
>> > + searchnode = first_online_node;
>> > + }
>> > + }
>> > if (unlikely(!node_match(page, searchnode))) {
>> > stat(s, ALLOC_NODE_MISMATCH);
>> > deactivate_slab(s, page, c->freelist, c);
>