Re: exclusive cpusets broken with cpu hotplug

From: Paul Jackson
Date: Thu Oct 19 2006 - 03:34:16 EST


> So that depends on what cpusets asks for. If, when setting up a and
> b, it asks to partition the domains, then yes that breaks the parent
> cpuset gets broken.

That probably makes good sense from the sched domain side of things.

It is insanely counterintuitive from the cpuset side of things.

Using heirarchical cpuset properties to drive this is the wrong
approach.

In the general case, looking at it (as best I can) from the sched
domain side of things, it seems that the sched domain could be
defined on a system as follows.

Partition the CPUs on the system - into one or more subsets
(partitions), non-overlapping, and covering.

Each of those partitions can either have a sched domain setup on
it, to support load balancing across the CPUs in that partition,
or can be isolated, with no load balancing occuring within that
partition.

No load balancing occurs across partitions.

Using cpu_exclusive cpusets for this is next to impossible. It could
be approximated perhaps by having just the immediate children of the
root cpuset, /dev/cpuset/*, define the partition.

But if any lower level cpusets have any affect on the partitioning,
by setting their cpu_exclusive flag in the current implementation,
it is -always- the case, by the basic structure of the cpuset
hierarchy, that the lower level cpuset is a subset of its parents
cpus, and that that parent also has cpu_exclusive set.

The resulting partitioning, even in such simple examples as above, is
not obvious. If you look back a couple days, when I first presented
essentially this example, I got the resulting sched domain partitioning
entirely wrong.

The essential detail in my big patch of yesterday, to add new specific
sched_domain flags to cpusets, is that it -removed- the requirement to
mark a parent as defining a sched domain anytime a child defined one.

That requirement is one of the defining properties of the cpu_exclusive
flag, and makes that flag -outrageously- unsuited for defining sched
domain partitions.

My new sched_domain flags at least had the right properties, defaults
and rules, that they perhaps could have been used to sanely define sched
domain partitions. One could mark a few select cpusets, at any depth
in the hierarchy, as defining sched domain partions, without being
forced to mark a whole bunch more ancestor cpusets the same way,
slicing and dicing the sched domain partions into hamburger.

However, fortunately, ... so far as I can tell ... no one needs the
general case described above, of multiple sched domain partitions.

So far as I know, the only essential special case that user land
requires to deal with is to isolate one partition (one subset of CPUs)
from any scheduler load balancing. Every CPU is either load
balanced however the kernel/sched.c chooses to load balance it,
with potentially every other non-isolated CPU, or is in the isolated
partition (cpu_isolated_map) and not considered for load balancing.

Have I missed any case requiring explicit user intervention?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/