Re: [PATCH v6 00/19] The new cgroup slab memory controller

From: Roman Gushchin
Date: Thu Jun 18 2020 - 21:34:26 EST


On Thu, Jun 18, 2020 at 10:43:44AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 17 Jun 2020 18:29:28 -0700
> Roman Gushchin <guro@xxxxxx> wrote:
>
> > On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:
> > > On 6/17/20 5:32 AM, Roman Gushchin wrote:
> > > > On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:
> > > >> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@xxxxxx> wrote:
> > > >> >
> > > >> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:
> > > >> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@xxxxxx> wrote:
> > > >> > > >
> > > >> [...]
> > > >> > >
> > > >> > > Have you performed any [perf] testing on SLAB with this patchset?
> > > >> >
> > > >> > The accounting part is the same for SLAB and SLUB, so there should be no
> > > >> > significant difference. I've checked that it compiles, boots and passes
> > > >> > kselftests. And that memory savings are there.
> > > >> >
> > > >>
> > > >> What about performance? Also you mentioned that sharing kmem-cache
> > > >> between accounted and non-accounted can have additional overhead. Any
> > > >> difference between SLAB and SLUB for such a case?
> > > >
> > > > Not really.
> > > >
> > > > Sharing a single set of caches adds some overhead to root- and non-accounted
> > > > allocations, which is something I've tried hard to avoid in my original version.
> > > > But I have to admit, it allows to simplify and remove a lot of code, and here
> > > > it's hard to argue with Johanness, who pushed on this design.
> > > >
> > > > With performance testing it's not that easy, because it's not obvious what
> > > > we wanna test. Obviously, per-object accounting is more expensive, and
> > > > measuring something like 1000000 allocations and deallocations in a line from
> > > > a single kmem_cache will show a regression. But in the real world the relative
> > > > cost of allocations is usually low, and we can get some benefits from a smaller
> > > > working set and from having shared kmem_cache objects cache hot.
> > > > Not speaking about some extra memory and the fragmentation reduction.
> > > >
> > > > We've done an extensive testing of the original version in Facebook production,
> > > > and we haven't noticed any regressions so far. But I have to admit, we were
> > > > using an original version with two sets of kmem_caches.
> > > >
> > > > If you have any specific tests in mind, I can definitely run them. Or if you
> > > > can help with the performance evaluation, I'll appreciate it a lot.
> > >
> > > Jesper provided some pointers here [1], it would be really great if you could
> > > run at least those microbenchmarks. With mmtests it's the major question of
> > > which subset/profiles to run, maybe the referenced commits provide some hints,
> > > or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/
> >
> > Oh, Jesper, I'm really sorry, somehow I missed your mail.
> > Thank you, Vlastimil, for pointing at it.
> >
> > I've got some results (slab_bulk_test01), but honestly I fail to interpret them.
> >
> > I ran original vs patched with SLUB and SLAB, each test several times and picked
> > 3 which looked most consistently. But it still looks very noisy.
> >
> > I ran them on my desktop (8-core Ryzen 1700, 16 GB RAM, Fedora 32),
> > it's 5.8-rc1 + slab controller v6 vs 5.8-rc1 (default config from Fedora 32).
>
> What about running these tests on the server level hardware, that you
> intent to run this on?

I'm going to backport this version to the kernel version we're using internally
and will come up with more number soon.

>
> >
> > How should I interpret this data?
>
> First of all these SLUB+SLAB microbenchmarks use object size 256 bytes,
> because network stack alloc object of this size for SKBs/sk_buff (due
> to cache-align as used size is 224 bytes). Checked SLUB: Each slab use
> 2 pages (8192 bytes) and contain 32 object of size 256 (256*32=8192).
>
> The SLUB allocator have a per-CPU slab which speedup fast-reuse, in this
> case up-to 32 objects. For SLUB the "fastpath reuse" test this behaviour,
> and it serves as a baseline for optimal 1-object performance (where my bulk
> API tries to beat that, which is possible even for 1-object due to knowing
> bulk API cannot be used from IRQ context).
>
> SLUB fastpath: 3 measurements reporting cycles(tsc)
> - SLUB-patched : fastpath reuse: 184 - 177 - 176 cycles(tsc)
> - SLUB-original: fastpath reuse: 178 - 153 - 156 cycles(tsc)
>
> There are some stability concerns as you mention, but it seems pretty
> consistently that patched version is slower. If you compile with
> no-PREEMPT you can likely get more stable results (and remove a slight
> overhead for SLUB fastpath).
>
> The microbenchmark also measures the bulk-API, which is AFAIK only used
> by network stack (and io_uring). I guess you shouldn't focus too much
> on these bulk measurements. When bulk-API cross this objects per slab
> threshold, or is unlucky is it use two per-CPU slab, then the
> measurements can fluctuate a bit.
>
> Your numbers for SLUB bulk-API:
>
> SLUB-patched - bulk-API
> - SLUB-patched : bulk_quick_reuse objects=1 : 187 - 90 - 224 cycles(tsc)
> - SLUB-patched : bulk_quick_reuse objects=2 : 110 - 53 - 133 cycles(tsc)
> - SLUB-patched : bulk_quick_reuse objects=3 : 88 - 95 - 42 cycles(tsc)
> - SLUB-patched : bulk_quick_reuse objects=4 : 91 - 85 - 36 cycles(tsc)
> - SLUB-patched : bulk_quick_reuse objects=8 : 32 - 66 - 32 cycles(tsc)
>
> SLUB-original - bulk-API
> - SLUB-original: bulk_quick_reuse objects=1 : 87 - 87 - 142 cycles(tsc)
> - SLUB-original: bulk_quick_reuse objects=2 : 52 - 53 - 53 cycles(tsc)
> - SLUB-original: bulk_quick_reuse objects=3 : 42 - 42 - 91 cycles(tsc)
> - SLUB-original: bulk_quick_reuse objects=4 : 91 - 37 - 37 cycles(tsc)
> - SLUB-original: bulk_quick_reuse objects=8 : 31 - 79 - 76 cycles(tsc)
>
> Maybe it is just noise or instability in measurements, but it seem that the
> 1-object case is consistently slower in your patched version.
>
> Mail is too long now... I'll take a look at your SLAB results and followup.


Thank you very much for helping with the analysis!

So does it mean you're looking at the smallest number in each series?
If so, the difference is not that big?

Theoretically speaking it should get worse (especially for non-root allocations),
but if the difference is not big, it still should be better, because there is
a big expected win from memory savings/smaller working set/less fragmentation etc.

The only thing I'm slightly worried is what's the effect on root allocations
if we're sharing slab caches between root- and non-root allocations. Because if
someone depends so much on the allocation speed, memcg-based accounting can be
ignored anyway. For most users the cost of allocation is negligible.
That's why the patch which merges root- and memcg slab caches is put on top
and can be reverted if somebody will complain.

Thank you!