Re: [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_mapissue

From: Nick Piggin
Date: Sun Sep 05 2004 - 21:52:32 EST

Next message: Nick Piggin: "Re: [PATCH] [ppc64] Allow SD_NODES_PER_DOMAIN to be overridden"
Previous message: Jon Smirl: "Re: Intel ICH - sound/pci/intel8x0.c"
In reply to: James Bottomley: "Re: [sched] fix sched_domains hotplug bootstrap ordering vs.cpu_online_map issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

James Bottomley wrote:

Well this patch got in, which is what I want, since it allows the
non-NUMA machines to work with hotplug CPUs again. However, is anyone
actually looking to fix this for real?

I think someone else (tm) is looking at it :)
Some of the IBM hotplug guys I think.

The fundamental problem is that NUMA or the scheduler (or both) are
broken with regard to hotplug.

The origin of the breakage is the differences between cpu_possible_map
and cpu_online_map. In hotplug CPU, there are two ways to do
initialisations: you can initialise from cpu_online_map, but then you
*must* have a cpu hotplug notify listener to add data structures for the
extra CPUs as they come on-line, or you can initialise from
cpu_possible_map and not bother with a notifier. The disadvantage of
the latter is that cpu_possible_map may be vastly larger than
cpu_online_map ever gets to, thus wasting valuable kernel memory.

The scheduler code is schizophrenic in this regard in that it does both:
it initialises static data structures from cpu_possible_map, but it also
has a hotplug cpu listener for starting things like the migration
threads.

I suspect the NUMA people would like us all to go to the former method
(initialise only from cpu_online_map and have a proper hotplug listener)
since their possible maps are pretty huge. However, which is it to be:
fix NUMA (to have two cpu_to_node() maps for the possible and online
cpus per node) or fix the scheduler to do initialisation correctly?

Perhaps this should be phased: change NUMA first temporarily for phase
one and then fix the scheduler (and everyone else initialising from
cpu_possible_map) in the second.

The scheduler *should* be able to be fixed nicely by using cpu_online_map
everywhere, and basically undoing then redoing the domains setup before and
after the hoplug, respectively.

So you'd re-attach the dummy domain to all CPUs, do the hotplug operation,
then setup the domains again and re-attach them.

This whole sequence could be pretty expensive, but I don't think the hotplug
guys care. It would allow us to get rid of cpus_and(... cpu_online_map) from a
lot of places in the scheduler too, which would be nice.

The actual code to do it shouldn't be more than a few lines (but I could be
overlooking something).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Nick Piggin: "Re: [PATCH] [ppc64] Allow SD_NODES_PER_DOMAIN to be overridden"
Previous message: Jon Smirl: "Re: Intel ICH - sound/pci/intel8x0.c"
In reply to: James Bottomley: "Re: [sched] fix sched_domains hotplug bootstrap ordering vs.cpu_online_map issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]