Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Shailabh Nagar
Date: Sun Aug 08 2004 - 15:10:32 EST

Next message: Shailabh Nagar: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Previous message: James Courtier-Dutton: "/dev problems with boot: unable to open an initial console."
Next in thread: Shailabh Nagar: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Paul Jackson wrote:

Erich Focht wrote:

we (NEC) are also a potential user of this patch

Good - welcome.

I think cpusets and CKRM should be
made to come together. One of CKRM's user interfaces is a filesystem
with the file-tree representing the class hierarchy. It's the same for
cpusets.

Hmmm ... this suggestion worries me, for a couple of reasons.

Just because cpusets and CKRM both have a hierarchy represented in a
file system doesn't mean it is, or can be, the same file system. Not
all trees are the same.

Perhaps someone more expert in CKRM can help here. The cpuset hierarchy
has some strict semantics:
1) Any cpusets CPUs and Memory must be a subset of its parents.
2) A cpuset may be exclusive for CPU or Memory only if its parent is.
3) A CPU or Memory exclusive cpuset may not overlap its siblings.

See the routine kernel/cpuset.c:validate_change() for the exact
coding of these rules.

If we followed your suggestion, Erich, would these rules still hold?
I can't imagine that the CKRM folks have any existing hierarchies with
these particular rules. They would need to if we went this way.

As CKRM stands today, we wouldn't be able to impose these constraints for exactly the reasons you point out. The other controllers would not forbid the move of a task violating the above rules to a CKRM class but this controller (CKRM's version of cpusets) would. Currently, on a task move, CKRM's core calls per-controller callbacks so the controller can make modifications to the controller-specific per-class objects. But controllers can't prevent such a move.

However, one of the CKRM changes suggested in the Kernel Summit was to split up the controllers and not have them bundled within a "core" class as we call it. In this model, each task would directly belong to some controller-specific class.

If CKRM were to adopt this change, one *potential* (but not necessary) consequence, is to have multiple hierarchies, one per-controller, exposed to the user e.g. instead of /rcfs/taskclass/<sameclasstree>, we would have /rcfs/cpu/<oneclasstree> and /rcfs/mem/<anotherclasstree> etc.

In such a scenario, it would be more logical for the controller to constrain memberships (i.e. task moves, class share setting while it is part of a hierarchy etc.) and it would be easy for cpusets to get its semantics.

On the flip side, what additional rules, if any, would CKRM impose
on this hierarchy?

Currently, we impose rules on the shares that one can set (child cannot have more than its parent, sibling shares should add up etc.) and we'd
discussed, but not implemented yet, some limit on how deep the common hierarchy would go.

The other reason that this suggestion worries me is a bit more
philosophical. I'm sure that for all the other, well known,
resources that CKRM manages, no one is proposing replacing whatever
existing names and mechanisms exist for those resources, such as
bandwidth, compute cycles, memory, ... Rather I presume that CKRM
provides an additional resource management layer on top of the
existing resources, which retain their classic names and apparatus.

What you seem to be suggesting here, especially with this nice
picture from your next post:

The files in cpusets are:
- cpus: list of CPUs in that cpuset
- mems: list of Memory Nodes in that cpuset
- cpu_exclusive flag: is cpu placement exclusive?
- mem_exclusive flag: is memory placement exclusive?
- tasks: list of tasks (by pid) attached to that cpuset
The files in a CKRM class directory:
- stats : statistics (not needed for cpusets)
- shares : could contain cpus, mems, cpu_exclusive, mem_exclusive
- members : same as reading /dev/cpusets/.../tasks
- target : same as writing /dev/cpusets/.../tasks

Changing the "shares" would mean something like
echo "cpus +6-10" > .../shares

would remove the cpuset specific interface forever, leaving it only
visible via a more generic "shares, members, target" interface suitable
for abstract resource management.

I am afraid that this would make it harder for new users of cpusets to
figure them out. Just cpusets by themselves add a new and strange layer
of abstraction, that will require a little bit of head scratching (as
Martin Bligh can testify to, from recent experience ;) for those
administering and managing the big iron where cpusets will be useful.

To add yet another layer of abstractions on top of that, from the CKRM
world, might send quite a few users into mental overload, doing the
usual stupid things we all do when we have given up on understanding and
are just thrashing about, trying to get something to work.

I think we are onto something useful here, the hierarchical organizing
of compute resources of CPU and Memory, which will become increasingly
relevant in the coming years, with bigger machines and more complex
compute and memory architectures.

I'd hate to see cpusets hidden behind resource management terms from day
one.

Yup, thats a valid concern. In this current round of CKRM redesign, we're considering whether controllers should be allowed to export their own interface (in a sense) by accepting different kinds of share settings. That is already true today in case of the "stats" and "config" virtual files which don't have any CKRM-imposed semantics. Only "shares" has a CKRM-defined set of values defined, not all of which are useful or will be implemented by a controller. We're debating whether to make that one controller-dependent too. If that happens, it'll make it somewhat better for cpusets. But I'm not sure if we'd want to go so far as to allow controllers to define what virtual files they export......we do that today for the classification engine because it is an entirely different beast but the controllers are similar.....

And, looking at it from the CKRM side (not sure I can, I'll try ...)
would it not seem a bit odd to a CKRM user that just one of the resource
types managed, these cpusets, had no apparent existence outside of the
CKRM hierarchy, unlike all the other resources, which existed a priori,
and, I presume, continue their independent existance?

From just the viewpoint of cpusets (not adding mem), it seems to be quite similar to what CKRM's other controllers are doing - grouping a per-task control (in your case, sched_setaffinity) using hierarchical sets.

Obviously, I could use a little CKRM expertise here.

But my inclination is to continue to view these two projects as separate,
with the potential that CKRM will someday add cpusets to the resource types
that it can manage.

Umm... I'm quite sure you mean , you'll contribute code to do that, right ? :-)

It looks like the interface issue is the main one from both projects' pov. Hopefully things will become clearer in the next week or so when ckrm-tech thrashes out the Kernel Summit suggestion (it has other ramifications besides interface).

-- Shailabh

Thank-you.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Shailabh Nagar: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Previous message: James Courtier-Dutton: "/dev problems with boot: unable to open an initial console."
Next in thread: Shailabh Nagar: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]