Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups

From: Roman Gushchin
Date: Wed Feb 05 2020 - 10:58:59 EST


On Tue, Feb 04, 2020 at 01:41:59PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > > > creating a separate set for each memory cgroup.
> > > > > > > >
> > > > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > > >
> > > > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > > > explain why we still need two sets of kmem_caches?
> > > > > > >
> > > > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > > >
> > > > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > > >
> > > > > > Well, arguably it can be done, but there are few drawbacks:
> > > > > >
> > > > > > 1) On the release path you'll need to make some extra work even for
> > > > > > root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > > >
> > > > > > 2) There will be a memory overhead for root allocations
> > > > > > (which might or might not be compensated by the increase
> > > > > > of the slab utilization).
> > > > >
> > > > > Those two are only true if there is a wild mix of root and cgroup
> > > > > allocations inside the same slab, and that doesn't really happen in
> > > > > practice. Either the machine is dedicated to one workload and cgroups
> > > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > > > systems (like most distro systems now) that cgroup everything.
> > > >
> > > > It's actually a questionable statement: we do skip allocations from certain
> > > > contexts, and we do merge slab caches.
> > > >
> > > > Most likely it's true for certain slab_caches and not true for others.
> > > > Think of kmalloc-* caches.
> > >
> > > With merging it's actually really hard to say how sparse or dense the
> > > resulting objcgroup arrays would be. It could change all the time too.
> >
> > So here is some actual data from my dev machine. The first column is the number
> > of pages in the root cache, the second - in the corresponding memcg.
> >
> > ext4_groupinfo_4k 1 0
> > rpc_inode_cache 1 0
> > fuse_request 62 0
> > fuse_inode 1 2732
> > btrfs_delayed_node 1192 0
> > btrfs_ordered_extent 129 0
> > btrfs_extent_map 8686 0
> > btrfs_extent_buffer 2648 0
> > btrfs_inode 12 6739
> > PINGv6 1 11
> > RAWv6 2 5
> > UDPv6 1 34
> > tw_sock_TCPv6 378 3
> > request_sock_TCPv6 24 0
> > TCPv6 46 74
> > mqueue_inode_cache 1 0
> > jbd2_journal_handle 2 0
> > jbd2_journal_head 2 0
> > jbd2_revoke_table_s 1 0
> > ext4_inode_cache 1 3
> > ext4_allocation_context 1 0
> > ext4_io_end 1 0
> > ext4_extent_status 5 0
> > mbcache 1 0
> > dnotify_struct 1 0
> > posix_timers_cache 24 0
> > xfrm_dst_cache 202 0
> > RAW 3 12
> > UDP 2 24
> > tw_sock_TCP 25 0
> > request_sock_TCP 24 0
> > TCP 7 24
> > hugetlbfs_inode_cache 2 0
> > dquot 2 0
> > eventpoll_pwq 1 119
> > dax_cache 1 0
> > request_queue 9 0
> > blkdev_ioc 241 0
> > biovec-max 112 0
> > biovec-128 2 0
> > biovec-64 6 0
> > khugepaged_mm_slot 248 0
> > dmaengine-unmap-256 1 0
> > dmaengine-unmap-128 1 0
> > dmaengine-unmap-16 39 0
> > sock_inode_cache 9 219
> > skbuff_ext_cache 249 0
> > skbuff_fclone_cache 83 0
> > skbuff_head_cache 138 141
> > file_lock_cache 24 0
> > net_namespace 1 5
> > shmem_inode_cache 14 56
> > task_delay_info 23 165
> > taskstats 24 0
> > proc_dir_entry 24 0
> > pde_opener 16 24
> > proc_inode_cache 24 1103
> > bdev_cache 4 20
> > kernfs_node_cache 1405 0
> > mnt_cache 54 0
> > filp 53 460
> > inode_cache 488 2287
> > dentry 367 10576
> > names_cache 24 0
> > ebitmap_node 2 0
> > avc_xperms_data 256 0
> > lsm_file_cache 92 0
> > buffer_head 24 9
> > uts_namespace 1 3
> > vm_area_struct 48 810
> > mm_struct 19 29
> > files_cache 14 26
> > signal_cache 28 143
> > sighand_cache 45 47
> > task_struct 77 430
> > cred_jar 29 424
> > anon_vma_chain 39 492
> > anon_vma 28 467
> > pid 30 369
> > Acpi-Operand 56 0
> > Acpi-Parse 5587 0
> > Acpi-State 4137 0
> > Acpi-Namespace 8 0
> > numa_policy 137 0
> > ftrace_event_field 68 0
> > pool_workqueue 25 0
> > radix_tree_node 1694 7776
> > task_group 21 0
> > vmap_area 477 0
> > kmalloc-rcl-512 473 0
> > kmalloc-rcl-256 605 0
> > kmalloc-rcl-192 43 16
> > kmalloc-rcl-128 1 47
> > kmalloc-rcl-96 3 229
> > kmalloc-rcl-64 6 611
> > kmalloc-8k 48 24
> > kmalloc-4k 372 59
> > kmalloc-2k 132 50
> > kmalloc-1k 251 82
> > kmalloc-512 360 150
> > kmalloc-256 237 0
> > kmalloc-192 298 24
> > kmalloc-128 203 24
> > kmalloc-96 112 24
> > kmalloc-64 796 24
> > kmalloc-32 1188 26
> > kmalloc-16 555 25
> > kmalloc-8 42 24
> > kmem_cache_node 20 0
> > kmem_cache 24 0
>
> That's interesting, thanks. It does look fairly bimodal, except in
> some smaller caches. Which does make sense when you think about it: we
> focus on accounting consumers that are driven by userspace activity
> and big enough to actually matter in terms of cgroup footprint.
>
> > > > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > > > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > > > In other words, memcg allocations will fragment root slab pages.
> > >
> > > I understand the first part of this paragraph, but not the second. The
> > > objcgroup vectors will be freed when the slab pages get freed. But the
> > > partially filled slab pages can be reused by any types of allocations,
> > > surely? How would this cause the pages to fragment?
> >
> > I mean the following: once you allocate a single accounted object
> > from the page, obj_cgroup vector is allocated and will be released only
> > with the slab page. We really really don't want to count how many accounted
> > objects are on the page and release obj_cgroup vector on reaching 0.
> > So even if all following allocations are root allocations, the overhead
> > will not go away with the uptime.
> >
> > In other words, even a small percentage of accounted objects will
> > turn the whole cache into "accountable".
>
> Correct. The worst case is where we have a large cache that has N
> objects per slab, but only ~1/N objects are accounted to a cgroup.
>
> The question is whether this is common or even realistic. When would a
> cache be big, but only a small subset of its allocations would be
> attributable to specific cgroups?
>
> On the less extreme overlapping cases, yeah there are fragmented
> obj_cgroup arrays, but there is also better slab packing. One is an
> array of pointers, the other is an array of larger objects. It would
> seem slab fragmentation has the potential to waste much more memory?
>
> > > > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > > > > except that obj_cgroups vector is created for the whole percpu block.
> > > > > > There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > > > > I kinda like using the same scheme here and there.
> > > > >
> > > > > It's hard to conclude anything based on this information alone. If
> > > > > it's truly expensive to merge them, then it warrants the additional
> > > > > complexity. But I don't understand the desire to share a design for
> > > > > two systems with sufficiently different constraints.
> > > > >
> > > > > > Upsides?
> > > > > >
> > > > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > > > > a huge effect, because both merging sets should be relatively big and well
> > > > > > utilized)
> > > > >
> > > > > Right.
> > > > >
> > > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > > > > but there isn't so much code left anyway.
> > > > >
> > > > > There is a lot of complexity associated with the cache cloning that
> > > > > isn't the lines of code, but the lifetime and synchronization rules.
> > > >
> > > > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > > > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > > >
> > > > Kmem_caches are created on demand on the first request (most likely during
> > > > the system start-up), and destroyed together with their root counterparts
> > > > (most likely never or on rmmod). First request means globally first request,
> > > > not a first request from a given memcg.
> > > >
> > > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > > > after creation just matches the lifetime of the root kmem caches.
> > > >
> > > > The only reason to keep the async creation is that some kmem_caches
> > > > are created very early in the boot process, long before any cgroup
> > > > stuff is initialized.
> > >
> > > Yes, it's independent of the obj_cgroup and memcg, and yes it's
> > > simpler after your patches. But I'm not talking about the delta, I'm
> > > trying to understand the end result.
> > >
> > > And the truth is there is a decent chunk of code and tentacles spread
> > > throughout the slab/cgroup code to clone, destroy, and handle the
> > > split caches, as well as the branches/indirections on every cgrouped
> > > slab allocation.
> >
> > Did you see the final code? It's fairly simple and there is really not
> > much of complexity left. If you don't think so, let's go into details,
> > because otherwise it's hard to say anything.
>
> I have the patches applied to a local tree and am looking at the final
> code. But I can only repeat that "it's not too bad" simply isn't a
> good explanation for why the code is the way it is.
>
> > With a such change which basically removes the current implementation
> > and replaces it with a new one, it's hard to keep the balance between
> > making commits self-contained and small, but also showing the whole picture.
> > I'm fully open to questions and generally want to make it simpler.
> >
> > I've tried to separate some parts and get them merged before the main
> > thing, but they haven't been merged yet, so I have to include them
> > to keep the thing building.
> >
> > Will a more-detailed design in the cover help?
> > Will writing a design doc to put into Documentation/ help?
> > Is it better to rearrange patches in a way to eliminate the current
> > implementation first and build from scratch?
>
> It would help to have changelogs that actually describe how the new
> design is supposed to work, and why you made the decisions you made.
>
> The changelog in this patch here sells the change as a reduction in
> complexity, without explaining why it stopped where it stopped. So
> naturally, if that's the declared goal, the first question is whether
> we can make it simpler.
>
> Both the cover letter and the changelogs should focus less on what was
> there and how it was deleted, and more on how the code is supposed to
> work after the patches. How the components were designed and how they
> all work together.
>
> As I said before, imagine somebody without any historical knowledge
> reading the code. They should be able to find out why you chose to
> have two sets of kmem caches. There is no explanation for it other
> than "there used to be more, so we cut it down to two".
>
> > > > > And these two things are the primary aspects that make my head hurt
> > > > > trying to review this patch series.
> > > > >
> > > > > > So IMO it's an interesting direction to explore, but not something
> > > > > > that necessarily has to be done in the context of this patchset.
> > > > >
> > > > > I disagree. Instead of replacing the old coherent model and its
> > > > > complexities with a new coherent one, you are mixing the two. And I
> > > > > can barely understand the end result.
> > > > >
> > > > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > > > whether the pages have an obj_cgroup array or not is *completely
> > > > > insane*. If the controller had followed the obj_cgroup design from the
> > > > > start, nobody would have ever thought about doing it like this.
> > > >
> > > > It's just not true. The whole point of having root- and memcg sets is
> > > > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > > > releasing of the root object. In other words, it allows to keep zero
> > > > overhead for root allocations. IMHO it's an important thing, and calling
> > > > it *completely insane* isn't the best way to communicate.
> > >
> > > But you're trading it for the indirection of going through a separate
> > > kmem_cache for every single cgroup-accounted allocation. Why is this a
> > > preferable trade-off to make?
> >
> > Because it allows to keep zero memory and cpu overhead for root allocations.
> > I've no data showing that this overhead is small and acceptable in all cases.
> > I think keeping zero overhead for root allocations is more important
> > than having a single set of kmem caches.
>
> In the kmem cache breakdown you provided above, there are 35887 pages
> allocated to root caches and 37300 pages allocated to cgroup caches.
>
> Why are root allocations supposed to be more important? Aren't some of
> the hottest allocations tracked by cgroups? Look at fork():
>
> > vm_area_struct 48 810
> > mm_struct 19 29
> > files_cache 14 26
> > signal_cache 28 143
> > sighand_cache 45 47
> > task_struct 77 430
> > cred_jar 29 424
> > anon_vma_chain 39 492
> > anon_vma 28 467
> > pid 30 369
>
> Hard to think of much hotter allocations. They all have to suffer the
> additional branch and cache footprint of the auxiliary cgroup caches.
>
> > > > I agree that it's an arguable question if we can tolerate some
> > > > additional overhead on root allocations to eliminate these additional
> > > > 10%, but I really don't think it's so obvious that even discussing
> > > > it is insane.
> > >
> > > Well that's exactly my point.
> >
> > Ok, what's the acceptable performance penalty?
> > Is adding 20% on free path is acceptable, for example?
> > Or adding 3% of slab memory?
>
> I find offhand replies like these very jarring.
>
> There is a legitimate design question: Why are you using a separate
> set of kmem caches for the cgroup allocations, citing the additional
> complexity over having one set? And your reply was mostly handwaving.

Johannes,

I posted patches and numbers that shows that the patchset improves
a fundamental kernel characteristic (slab utilization) by a meaningful margin.
It has been confirmed by others, who kindly tested it on their machines.

Surprisingly, during this and previous review sessions, I didn't hear
a single good word from you, but a constant stream of blame: I do not answer
questions, I do not write perfect code, I fail to provide satisfying
answers, I'm waving hands, saying insane things etc etc.
Any minimal disagreement with you and you're basically raising the tone.

I find this style of discussions irritating and non-productive.
So I'm taking a break and start working on the next version.