Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues

From: Eric Dumazet
Date: Tue Feb 28 2023 - 10:07:19 EST


On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Hi!
>
> Wangyang and Arjan reported a bottleneck in the networking code related to
> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> a dst_entry increases.

We have per-cpu or per-tcp-socket dst though.

Input path is RCU and does not touch dst refcnt.

In real workloads (200Gbit NIC and above), we do not observe
contention on a dst refcnt.

So it would be nice knowing in which case you noticed some issues,
maybe there is something wrong in some layer.

>
> This happens when there are a large amount of connections to or from the
> same IP address. The memtier benchmark when run on the same host as
> memcached amplifies this massively. But even over real network connections
> this issue can be observed at an obviously smaller scale (due to the
> network bandwith limitations in my setup, i.e. 1Gb).
>
> There are two factors which make this reference count a scalability issue:
>
> 1) False sharing
>
> dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
> it into a seperate cacheline vs. the read mostly members located at
> the beginning of the struct.
>
> That prevents false sharing vs. the struct members in the first 64
> bytes of the structure, but there is also
>
> dst_entry::lwtstate
>
> which is located after the reference count and in the same cache
> line. This member is read after a reference count has been acquired.
>
> The other problem is struct rtable, which embeds a struct dst_entry
> at offset 0. struct dst_entry has a size of 112 bytes, which means
> that the struct members of rtable which follow the dst member share
> the same cache line as dst_entry::__refcnt. Especially
>
> rtable::rt_genid
>
> is also read by the contexts which have a reference count acquired
> already.
>
> When dst_entry:__refcnt is incremented or decremented via an atomic
> operation these read accesses stall and contribute to the performance
> problem.

In our kernel profiles, we never saw dst->refcnt changes being a
serious problem.

>
> 2) atomic_inc_not_zero()
>
> A reference on dst_entry:__refcnt is acquired via
> atomic_inc_not_zero() and released via atomic_dec_return().
>
> atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
> which exposes O(N^2) behaviour under contention with N concurrent
> operations.
>
> Lightweight instrumentation exposed an average of 8!! retry loops per
> atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
> running concurrently on 112 CPUs.

User space benchmark <> kernel space.
And we tend not using 112 cpus for kernel stack processing.

Again, concurrent dst->refcnt changes are quite unlikely.

>
> There is nothing which can be done to make atomic_inc_not_zero() more
> scalable.
>
> The following series addresses these issues:
>
> 1) Reorder and pad struct dst_entry to prevent the false sharing.
>
> 2) Implement and use a reference count implementation which avoids the
> atomic_inc_not_zero() problem.
>
> It is slightly less performant in the case of the final 1 -> 0
> transition, but the deconstruction of these objects is a low
> frequency event. get()/put() pairs are in the hotpath and that's
> what this implementation optimizes for.
>
> The algorithm of this reference count is only suitable for RCU
> managed objects. Therefore it cannot replace the refcount_t
> algorithm, which is also based on atomic_inc_not_zero(), due to a
> subtle race condition related to the 1 -> 0 transition and the final
> verdict to mark the reference count dead. See details in patch 2/3.
>
> It might be just my lack of imagination which declares this to be
> impossible and I'd be happy to be proven wrong.
>
> As a bonus the new rcuref implementation provides underflow/overflow
> detection and mitigation while being performance wise on par with
> open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
> the non-contended case.
>
> The combination of these two changes results in performance gains in micro
> benchmarks and also localhost and networked memtier benchmarks talking to
> memcached. It's hard to quantify the benchmark results as they depend
> heavily on the micro-architecture and the number of concurrent operations.
>
> The overall gain of both changes for localhost memtier ranges from 1.2X to
> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>
> A micro benchmark which enforces maximized concurrency shows a gain between
> 1.2X and 4.7X!!!

Can you elaborate on what networking benchmark you have used,
and what actual gains you got ?

In which path access to dst->lwtstate proved to be a problem ?

>
> Obviously this is focussed on a particular problem and therefore needs to
> be discussed in detail. It also requires wider testing outside of the cases
> which this is focussed on.
>
> Though the false sharing issue is obvious and should be addressed
> independent of the more focussed reference count changes.
>
> The series is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>
> I want to say thanks to Wangyang who analyzed the issue and provided
> the initial fix for the false sharing problem. Further thanks go to
> Arjan Peter, Marc, Will and Borislav for valuable input and providing
> test results on machines which I do not have access to.
>
> Thoughts?

Initial feeling is that we need more details on the real workload.

Is it some degenerated case with one connected UDP socket used by
multiple threads ?

To me, this looks like someone wanted to push a new piece of infra
(include/linux/rcuref.h)
and decided that dst->refcnt would be a perfect place.
Not the other way (noticing there is an issue, enquire networking
folks about it, before designing a solution)

Thanks



>
> Thanks,
>
> tglx
> ---
> include/linux/rcuref.h | 89 +++++++++++
> include/linux/types.h | 6
> include/net/dst.h | 21 ++
> include/net/sock.h | 2
> lib/Makefile | 2
> lib/rcuref.c | 311 ++++++++++++++++++++++++++++++++++++++++
> net/bridge/br_nf_core.c | 2
> net/core/dst.c | 26 ---
> net/core/rtnetlink.c | 2
> net/ipv6/route.c | 6
> net/netfilter/ipvs/ip_vs_xmit.c | 4
> 11 files changed, 436 insertions(+), 35 deletions(-)