Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues

From: Eric Dumazet
Date: Tue Feb 28 2023 - 12:00:14 EST


On Tue, Feb 28, 2023 at 5:38 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Eric!
>
> On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> > On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >>
> >> Hi!
> >>
> >> Wangyang and Arjan reported a bottleneck in the networking code related to
> >> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> >> a dst_entry increases.
> >
> > We have per-cpu or per-tcp-socket dst though.
> >
> > Input path is RCU and does not touch dst refcnt.
> >
> > In real workloads (200Gbit NIC and above), we do not observe
> > contention on a dst refcnt.
> >
> > So it would be nice knowing in which case you noticed some issues,
> > maybe there is something wrong in some layer.
>
> Two lines further down I explained which benchmark was used, no?
>
> >> This happens when there are a large amount of connections to or from the
> >> same IP address. The memtier benchmark when run on the same host as
> >> memcached amplifies this massively. But even over real network connections
> >> this issue can be observed at an obviously smaller scale (due to the
> >> network bandwith limitations in my setup, i.e. 1Gb).
> >> atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
> >> which exposes O(N^2) behaviour under contention with N concurrent
> >> operations.
> >>
> >> Lightweight instrumentation exposed an average of 8!! retry loops per
> >> atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
> >> running concurrently on 112 CPUs.
> >
> > User space benchmark <> kernel space.
>
> I know that. The point was to illustrate the non-scalability.
>
> > And we tend not using 112 cpus for kernel stack processing.
> >
> > Again, concurrent dst->refcnt changes are quite unlikely.
>
> So unlikely that they stand out in that particular benchmark.
>
> >> The overall gain of both changes for localhost memtier ranges from 1.2X to
> >> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
> >>
> >> A micro benchmark which enforces maximized concurrency shows a gain between
> >> 1.2X and 4.7X!!!
> >
> > Can you elaborate on what networking benchmark you have used,
> > and what actual gains you got ?
>
> I'm happy to repeat here that it was memtier/memcached as I explained
> more than once in the cover letter.
>
> > In which path access to dst->lwtstate proved to be a problem ?
>
> ip_finish_output2()
> if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

This change alone should be easy to measure, please do this ?

Oftentimes, moving a field looks sane, but the cache line access is
simply done later.
For example when refcnt is changed :)

Making dsts one cache line bigger has a performance impact.

>
> > To me, this looks like someone wanted to push a new piece of infra
> > (include/linux/rcuref.h)
> > and decided that dst->refcnt would be a perfect place.
> >
> > Not the other way (noticing there is an issue, enquire networking
> > folks about it, before designing a solution)
>
> We looked at this because the reference count operations stood out in
> perf top and we analyzed it down to the false sharing _and_ the
> non-scalability of atomic_inc_not_zero().
>

Please share your recipe and perf results.

We must have been very lucky to not see this at Google.

tcp_rr workloads show dst_mtu() costs (60k GCU in Google fleet) ,
which outperform the dst refcnt you are mentioning here.