[RFC] cgroup TODOs

From: Tejun Heo
Date: Thu Sep 13 2012 - 16:58:31 EST

Next message: David Ahern: "[PATCH 0/3 v3] perf: precise mode and exclude_guest"
Previous message: Rafael Aquini: "Re: [PATCH -mm] enable CONFIG_COMPACTION by default"
Next in thread: Mike Galbraith: "Re: [RFC] cgroup TODOs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello, guys.

Here's the write-up I promised last week about what I think are the
problems in cgroup and what the current plans are.

First of all, it's a mess. Shame on me. Shame on you. Shame on all
of us for allowing this mess. Let's all tremble in shame for solid
ten seconds before proceeding.

I'll list the issues I currently see with cgroup (easier ones first).
I think I now have at least tentative plans for all of them and will
list them together with the prospective asignees (my wish mostly).
Unfortunately, some of the plans involve userland visible changes
which would at least cause some discomfort and require adjustments on
their part.

1. cpu and cpuacct

They cover the same resources and the scheduler cgroup code ends up
having to traverse two separate cgroup trees to update the stats.
With nested cgroups, the overhead isn't insignificant and it
generally is silly.

While the use cases for having cpuacct on a separate and likely more
granular hierarchy, are somewhat valid, the consensus seems that
it's just not worth the trouble and cpuacct should be removed in the
long term and we shouldn't allow overlapping controllers for the
same resource, especially accounting ones.

Solution:

* Whine if cpuacct is not co-mounted with cpu.

* Make sure cpu has all the stats of cpuacct. If cpu and cpuacct
are comounted, don't really mount cpuacct but tell cpu that the
user requested it. cpu is updated to create aliases for cpuacct.*
files in such cases. This involves special casing cpuacct in
cgroup core but I much prefer one-off exception case to adding a
generic mechanism for this.

* After a while, we can just remove cpuacct completely.

* Later on, phase out the aliases too.

Who:

Me, working on it.

2. memcg's __DEPRECATED_clear_css_refs

This is a remnant of another weird design decision of requiring
synchronous draining of refcnts on cgroup removal and allowing
subsystems to veto cgroup removal - what's the userspace supposed to
do afterwards? Note that this also hinders co-mounting different
controllers.

The behavior could be useful for development and debugging but it
unnecessarily interlocks userland visible behavior with in-kernel
implementation details. To me, it seems outright wrong (either
implement proper severing semantics in the controller or do full
refcnting) and disallows, for example, lazy drain of caching refs.
Also, it complicates the removal path with try / commit / revert
logic which has never been fully correct since the beginning.

Currently, the only left user is memcg.

Solution:

* Update memcg->pre_destroy() such that it never fails.

* Drop __DEPRECATED_clear_css_refs and all related logic.
Convert pre_destroy() to return void.

Who:

KAMEZAWA, Michal, PLEASE. I will make __DEPRECATED_clear_css_refs
trigger WARN sooner or later. Let's please get this settled.

3. cgroup_mutex usage outside cgroup core

This is another thing which is simply broken. Given the way cgroup
is structured and used, nesting cgroup_mutex inside any other
commonly used lock simply doesn't work - it's held while invoking
controller callbacks which then interact and synchronize with
various core subsystems.

There are currently three external cgroup_mutex users - cpuset,
memcontrol and cgroup_freezer.

Solution:

Well, we should just stop doing it - use a separate nested lock
(which seems possible for cgroup_freezer) or track and mange task
in/egress some other way.

Who:

I'll do the cgroup_freezer. I'm hoping PeterZ or someone who's
familiar with the code base takes care of cpuset. Michal, can you
please take care of memcg?

4. Make disabled controllers cheaper

Mostly through the use of static_keys, I suppose. Making this
easier AFAICS depends on resolving #2. The lock dependency loop
from #2 makes using static_keys from cgroup callbacks extremely
nasty.

Solution:

Fix #2 and support common pattern from cgroup core.

Who:

Dunno. Let's see.

5. I CAN HAZ HIERARCHIES?

The cpu ones handle nesting correctly - parent's accounting includes
children's, parent's configuration affects children's unless
explicitly overridden, and children's limits nest inside parent's.

memcg asked itself the existential question of to be hierarchical or
not and then got confused and decided to become both.

When faced with the same question, blkio and cgroup_freezer just
gave up and decided to allow nesting and then ignore it - brilliant.

And there are others which kinda sorta try to handle hierarchy but
only goes way-half.

This one is screwed up embarrassingly badly. We failed to establish
one of the most basic semantics and can't even define what a cgroup
hierarchy is - it depends on each controller and they're mostly
wacky!

Fortunately, I don't think it will be prohibitively difficult to dig
ourselves out of this hole.

Solution:

* cpu ones seem fine.

* For broken controllers, cgroup core will be generating warning
messages if the user tries to nest cgroups so that the user at
least can know that the behavior may change underneath them later
on. For more details,

http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902

* memcg can be fully hierarchical but we need to phase out the flat
hierarchy support. Unfortunately, this involves flipping the
behavior for the existing users. Upstream will try to nudge users
with warning messages. Most burden would be on the distros and at
least SUSE seems to be on board with it. Needs coordination with
other distros.

* blkio is the most problematic. It has two sub-controllers - cfq
and blk-throttle. Both are utterly broken in terms of hierarchy
support and the former is known to have pretty hairy code base. I
don't see any other way than just biting the bullet and fixing it.

* cgroup_freezer and others shouldn't be too difficult to fix.

Who:

memcg can be handled by memcg people and I can handle cgroup_freezer
and others with help from the authors. The problematic one is
blkio. If anyone is interested in working on blkio, please be my
guest. Vivek? Glauber?

6. Multiple hierarchies

Apart from the apparent wheeeeeeeeness of it (I think I talked about
that enough the last time[1]), there's a basic problem when more
than one controllers interact - it's impossible to define a resource
group when more than two controllers are involved because the
intersection of different controllers is only defined in terms of
tasks.

IOW, if an entity X is of interest to two controllers, there's no
way to map X to the cgroups of the two controllers. X may belong to
A and B when viewed by one task but A' and B when viewed by another.
This already is a head scratcher in writeback where blkcg and memcg
have to interact.

While I am pushing for unified hierarchy, I think it's necessary to
have different levels of granularities depending on controllers
given that nesting involves significant overhead and noticeable
controller-dependent behavior changes.

Solution:

I think a unified hierarchy with the ability to ignore subtrees
depending on controllers should work. For example, let's assume the
following hierarchy.

R
/ \
A B
/ \
AA AB

All controllers are co-mounted. There is per-cgroup knob which
controls which controllers nest beyond it. If blkio doesn't want to
distinguish AA and AB, the user can specify that blkio doesn't nest
beyond A and blkio would see the tree as,

R
/ \
A B

While other controllers keep seeing the original tree. The exact
form of interface, I don't know yet. It could be a single file
which the user echoes [-]controller name into it or per-controller
boolean file.

I think this level of flexibility should be enough for most use
cases. If someone disagrees, please voice your objections now.

I *think* this can be achieved by changing where css_set is bound.
Currently, a css_set is (conceptually) owned by a task. After the
change, a cgroup in the unified hierarchy has its own css_set which
tasks point to and can also be used to tag resources as necessary.
This way, it should be achieveable without introducing a lot of new
code or affecting individual controllers too much.

The headache will be the transition period where we'll probably have
to support both modes of operation. Oh well....

Who:

Li, Glauber and me, I guess?

7. Misc issues

* Sort & unique when listing tasks. Even the documentation says it
doesn't happen but we have a good hunk of code doing it in
cgroup.c. I'm gonna rip it out at some point. Again, if you
don't like it, scream.

* At the PLC, pjt told me that assinging threads of a cgroup to
different cgroups is useful for some use cases but if we're to
have a unified hierarchy, I don't think we can continue to do
that. Paul, can you please elaborate the use case?

* Vivek brought up the issue of distributing resources to tasks and
groups in the same cgroup. I don't know. Need to think more
about it.

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel.cgroups/857
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Ahern: "[PATCH 0/3 v3] perf: precise mode and exclude_guest"
Previous message: Rafael Aquini: "Re: [PATCH -mm] enable CONFIG_COMPACTION by default"
Next in thread: Mike Galbraith: "Re: [RFC] cgroup TODOs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]