Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups

From: Roman Gushchin
Date: Mon Feb 03 2020 - 17:39:11 EST


On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > >
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > >
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > >
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > >
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> >
> > Well, arguably it can be done, but there are few drawbacks:
> >
> > 1) On the release path you'll need to make some extra work even for
> > root allocations: calculate the offset only to find the NULL objcg pointer.
> >
> > 2) There will be a memory overhead for root allocations
> > (which might or might not be compensated by the increase
> > of the slab utilization).
>
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.
>
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > except that obj_cgroups vector is created for the whole percpu block.
> > There will be root- and memcg-blocks, and it will be expensive to merge them.
> > I kinda like using the same scheme here and there.
>
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
>
> > Upsides?
> >
> > 1) slab utilization might increase a little bit (but I doubt it will have
> > a huge effect, because both merging sets should be relatively big and well
> > utilized)
>
> Right.
>
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > but there isn't so much code left anyway.
>
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.
>
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
>
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
>
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
>
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

Having two sets of kmem_caches has nothing to do with the refcounting
and obj_cgroup abstraction.
Please, take a look at the final code.

Thanks!