Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection

From: Johannes Weiner
Date: Fri Feb 14 2020 - 11:53:20 EST


On Fri, Feb 14, 2020 at 04:13:18PM +0100, Michal Hocko wrote:
> On Fri 14-02-20 08:57:28, Tejun Heo wrote:
> [...]
>
> Sorry to skip over a large part of your response. The discussion in this
> thread got quite fragmented already and I would really like to conclude
> to something.
>
> > > I believe I have already expressed the configurability concern elsewhere
> > > in the email thread. It boils down to necessity to propagate
> > > protection all the way up the hierarchy properly if you really need to
> > > protect leaf cgroups that are organized without a resource control in
> > > mind. Which is what systemd does.
> >
> > But that doesn't work for other controllers at all. I'm having a
> > difficult time imagining how making this one control mechanism work
> > that way makes sense. Memory protection has to be configured together
> > with IO protection to be actually effective.
>
> Please be more specific. If the protected workload is mostly in-memory,
> I do not really see how IO controller is relevant. See the example of
> the DB setup I've mentioned elsewhere.

Say you have the following tree:

root
/ \
system.slice user.slice
/ | |
cron journal user-100.slice
|
session-c2.scope
| \
the_workload misc

You're saying you want to prioritize the workload above everything
else in the system: system tasks, other users' stuff, even other stuff
running as your own user. Therefor you configure memory.low on the
workload and propagate the value up the tree. So far so good, memory
is protected.

However, what you've really done just now is flattened the resource
hierarchy. You configured the_workload not just more important than
its sibling "misc", but you actually pulled it up the resource tree
and declared it more important than what's running in other sessions,
what users are running, and even the system software. Your cgroup tree
still reflects process ownership, but it doesn't actually reflect the
resource hierarchy you just configured.

And that is something that other controllers do not support: you
cannot give IO weight to the_workload without also making the c2
session more important than other sessions, user 100 more important
than other users, and user.slice more important than system.slice.

Memory is really the only controller in which this kind of works.

And yes, maybe you're talking about a highly specific case where
the_workload is mostly in memory and also doesn't need any IO capacity
and you have CPU to spare, so this is good enough for you and you
don't care about the other controllers in that particular usecase.

But the discussion is about the broader approach to resource control:
if other controllers already require the cgroup hierarchy to reflect
resource hierarchy, memory shouldn't be different just because.

And in practice, most workloads *do* in fact need comprehensive
control. Which workload exclusively needs memory, but no IO, and no
CPU? If you only control for memory, lower priority workloads tend to
eat up your CPU doing reclaim and your IO doing paging.

So when talking about the design and semantics of one controller, you
have to think about this comprehensively.

The proper solution to implement the kind of resource hierarchy you
want to express in cgroup2 is to reflect it in the cgroup tree. Yes,
the_workload might have been started by user 100 in session c2, but in
terms of resources, it's prioritized over system.slice and user.slice,
and so that's the level where it needs to sit:

root
/ | \
system.slice user.slice the_workload
/ | |
cron journal user-100.slice
|
session-c2.scope
|
misc

Then you can configure not just memory.low, but also a proper io
weight and a cpu weight. And the tree correctly reflects where the
workload is in the pecking order of who gets access to resources.

> > As for cgroup hierarchy being unrelated to how controllers behave, it
> > frankly reminds me of cgroup1 memcg flat hierarchy thing I'm not sure
> > how that would actually work in terms of resource isolation. Also, I'm
> > not sure how systemd forces such configurations and I'd think systemd
> > folks would be happy to fix them if there are such problems. Is the
> > point you're trying to make "because of systemd, we have to contort
> > how memory controller behaves"?
>
> No, I am just saying and as explained in reply to Johannes, there are
> practical cases where the cgroup hierarchy reflects organizational
> structure as well.

There is nothing wrong with that, and systemd does it per default. But
it also doesn't enable resource control domains per default.

Once you do split up resource domains, you cannot express both
hierarchies in a single tree. And cgroup2 controllers are designed
such that resource domain hierarchy takes precedence.

It's perfectly fine to launch the workload in a different/new resource
domain. This doesn't conflict with systemd, and is in fact supported
by it. See the Slice= attribute (systemd.resource_control(5)).

Sure you need to be privileged to do this, but you also need to be
privileged to set memory.low on user.slice...