Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

From: Tejun Heo
Date: Sat Aug 22 2015 - 14:29:57 EST


Hello, Paul.

On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote:
...
> A very concrete example of the above is a virtual machine in which you
> want to guarantee scheduling for the vCPU threads which must schedule
> beside many hypervisor support threads. A hierarchy is the only way
> to fix the ratio at which these compete.

Just to learn more, what sort of hypervisor support threads are we
talking about? They would have to consume considerable amount of cpu
cycles for problems like this to be relevant and be dynamic in numbers
in a way which letting them competing against vcpus makes sense. Do
IO helpers meet these criteria?

> An example that's not the cpu controller is that we use cpusets to
> expose to applications their "shared" and "private" cores. (These
> sets are dynamic based on what is coscheduled on a given machine.)

Can you please go into more details with these?

> > Why would you assume that threads of a process wouldn't want to
> > configure it ever? How is this different from CPU affinity?
>
> In general cache and CPU behave differently. Generally for it to make
> sense between threads in a process they would have to have wholly
> disjoint memory, at which point the only sane long-term implementation
> is separate processes and the management moves up a level anyway.
>
> That said, there are surely cases in which it might be convenient to
> use at a per-thread level to correct a specific performance anomaly.
> But at that point, you have certainly reached the level of hammer that
> you can coordinate with an external daemon if necessary.

So, I'm not super familiar with all the use cases but the whole cache
allocation thing is almost by nature a specific niche thing and I feel
pretty reluctant to blow off per-thread usages as too niche to worry
about.

> > I don't follow what you're trying to way with the above paragraph.
> > Are you still talking about CAT? If so, that use case isn't the only
> > one. I'm pretty sure there are people who would want to configure
> > cache allocation at thread level.
>
> I'm not agreeing with you that "in cgroups" means "must be usable by
> applications within that hierarchy". A cgroup subsystem used as a
> partitioning API only by system management daemons is entirely
> reasonable. CAT is a reasonable example of this.

I see. The same argument. I don't think CAT just being system
management thing makes sense.

> > So, this is a trade-off we're consciously making. If there are
> > common-enough use cases which require jumping across different cgroup
> > domains, we'll try to figure out a way to accomodate those but by
> > default migration is a very cold and expensive path.
>
> The core here was the need for allowing sub-process migration. I'm
> not sure I follow the performance trade-off argument; haven't we
> historically seen the opposite? That migration has been a slow-path
> without optimizations and people pushing to make it faster? This
> seems a hard generalization to make for something that's inherently
> tied to a particular controller.

It isn't something tied to a particular controller. Some controllers
may get impacted less by than others but there's an inherent
connection between how dynamic an association is and how expensive the
locking around it needs to be and we need to set up basic behavior and
usage conventions so that different controllers are designed and
implemented assuming similar usage patterns; otherwise, we end up with
the chaotic shit show that we have had where everything behaves
differently and nobody knows what's the right way to do things and we
end up locked into weird requirements which some controller induced
for no good reason but cause significant pain on use cases which
actually matter.

> I don't care if we try turning that dial back to assume it's a cold
> path once more, only that it's supported.

It has always been a cold path and I'm not saying this is gonna be
noticeably worse in the future but usages like bouncing threads on
request-by-request basis are and will be clearly worse than bouncing
to threads which are already in the target domain.

> >> > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another
> >> > way to address some of these issues.
> >
> > That sounds horrible to me. What if the process wants to do RMW a
> > config?
>
> Locking within a process is easy.

It's not contained in the process at all. What if an external entity
decides to migrate the process into another cgroup inbetween?

> > What if the permissions are different after an intervening
> > migration?
>
> This is a side-effect of migration not being properly supported.
>
> > What if the sub-hierarchy no longer exists or has been
> > replaced by a hierarchy with the same topology but actualy is a
> > different one?
>
> The easy answer is that: Only a process should be managing its
> sub-hierarchy. That's the nice thing about hierarchies.

cgroupfs is a horrible place to implement that part of interface. It
doesn't make any sense to combine those two into the same hierarchy.
You're agreeing to the identified problem but somehow still suggesting
doing what we've been doing when the root cause of the said problem is
conflating and interlocking these two separate things.

> The harder answer is: How do we handle non-fungible resources such as
> CPU assignments within a hierarchy? This is a big part of why I make
> arguments for certain partitions being management-software only above.
> This is imperfect, but better then where we stand today.

I'm not following. Why is that different?

> > Let's build an API which actually looks and behaves like an API which
> > is properly isolated from what external agents may do to the process.
> > I can't see how that would be "back to where we are today". All of
> > those are pretty critical attributes for a public kernel API and
> > utterly broken right now.
>
> Sure, but I don't think you can throw out per-thread control for all
> controllers to enable this. Which makes everything else harder. A
> intermediary step in unification might be that we move from N mounts
> to 2. Those that can be managed at the process level, and those that
> can't. It's a compromise, but may allow cleaner abstractions for the
> former case.

The transition can already be gradual. Why would you add yet another
transition step?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/