Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

From: Michal Hocko
Date: Thu Jul 02 2020 - 04:41:30 EST


On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@xxxxxxxxxx> [2020-07-01 14:21:10]:
>
> > > >>>>>>
> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > > >>>>>> number of online nodes is inconsistent with the information in the
> > > >>>>>> device-tree and resource-dump
> > > >>>>>>
> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > > >>>>>> the hit from the unnecessary numa hinting faults.
> > > >>>>>
> > > >>>>> I have to say that I dislike the node online/offline state and directly
> > > >>>>> exporting that to the userspace. Users should only care whether the node
> > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > > >>>>> offline all the present memory blocks but do not physically hot remove
> > > >>>>> them and you are in the same situation. If users are confused by an
> > > >>>>> output of tools like numactl -H then those could be updated and hide
> > > >>>>> nodes without any memory&cpus.
> > > >>>>>
> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > >>>>> really solve the underlying problem because I strongly suspect that the
> > > >>>>> problem is still there when a numa node gets all its memory offline as
> > > >>>>> mentioned above.
> >
> > I would really appreciate a feedback to these two as well.
>
> 1. Its not just numactl that's to be fixed but all tools/utilities that
> depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> in the output given by the kernel but do further verification?

No, what we are saying is that even an online node might have zero
number of online pages/cpus. So the online status is not really
something that matters. If people are confused by that output then user
space tools can make their confusion go away. I really do not understand
why the kernel should do any logic there.

> Also how would the user space differentiate between the case where the
> Kernel missed marking a node as offline to the case where the memory was
> offlined on a cpuless node but node wasn't offline?.

What I am arguing is that those two shouldn't be any different. Really!

> 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> so if there is a performance hit, its something that's driven by his
> user/admin actions. Also how often do we see users offline complete memory
> of cpuless node on a 2 node system?

How often do we see crippled HW configurations like that? Really if
autonuma should be made more clever for one case it should recognize the
other as well.

> > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> >
> > This begs a question whether ppc can do the same thing?
>
> Certainly ppc can be made to adapt to this situation but that would be a
> workaround. Do we have a reason why we think node 0 is unique and special?

It is not. As replied in other email in this thread. I would hope for
having less hacks in the numa initialization. Cleaning up the mess is
would be a lot of work and testing on all NUMA capable architectures.
This is a heritage from the past I am afraid. All that I am arguing here
is that your touch to the generic code with a very simple looking patch
might have side effects which are pretty much impossible to review.
Moreover it seems that nothing but ppc really needs this treatment.
So fixing it in ppc specific code sounds much more safe.

Normally I would really push for a generic solution but after getting
burned several times in this area I do not dare anymore. The problem is
not in the code complexity but in how spread it is in places where you
do not expect side effects.

--
Michal Hocko
SUSE Labs