Well this patch got in, which is what I want, since it allows the
non-NUMA machines to work with hotplug CPUs again. However, is anyone
actually looking to fix this for real?
The fundamental problem is that NUMA or the scheduler (or both) are
broken with regard to hotplug.
The origin of the breakage is the differences between cpu_possible_map
and cpu_online_map. In hotplug CPU, there are two ways to do
initialisations: you can initialise from cpu_online_map, but then you
*must* have a cpu hotplug notify listener to add data structures for the
extra CPUs as they come on-line, or you can initialise from
cpu_possible_map and not bother with a notifier. The disadvantage of
the latter is that cpu_possible_map may be vastly larger than
cpu_online_map ever gets to, thus wasting valuable kernel memory.
The scheduler code is schizophrenic in this regard in that it does both:
it initialises static data structures from cpu_possible_map, but it also
has a hotplug cpu listener for starting things like the migration
threads.
I suspect the NUMA people would like us all to go to the former method
(initialise only from cpu_online_map and have a proper hotplug listener)
since their possible maps are pretty huge. However, which is it to be:
fix NUMA (to have two cpu_to_node() maps for the possible and online
cpus per node) or fix the scheduler to do initialisation correctly?
Perhaps this should be phased: change NUMA first temporarily for phase
one and then fix the scheduler (and everyone else initialising from
cpu_possible_map) in the second.