Re: Question about cacheline bounching with percpu-rwsem and rcu-sync

From: Paul E. McKenney
Date: Sun Jun 09 2019 - 08:27:03 EST


On Sat, Jun 08, 2019 at 08:24:36PM -0400, Joel Fernandes wrote:
> On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> [snip]
> > >
> > > Either way, it would be good for you to just try it. Create a kernel
> > > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > > and empirically check the scalability on a largish system. Then compare
> > > this to down_read() and up_read()
> >
> > Will do! thanks.
>
> I created a test for this and the results are quite amazing just
> stressed read lock/unlock for rwsem vs percpu-rwsem.
> The test is conducted on a dual socket Intel x86_64 machine with 14
> cores each socket.
>
> Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
> https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424

Interesting location, but looks functional. ;-)

> Graphs/Results here:
> https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing
>
> The completion time of the test goes up somewhat exponentially with
> the number of threads, for the rwsem case, where as for percpu-rwsem
> it is the same. I could add this data to some of the documentation as
> well.

Actually, the completion time looks to be pretty close to linear in the
number of CPUs. Which is still really bad, don't get me wrong.

Thank you for doing this, and it might be good to have some documentation
on this. In perfbook, I use counters to make this point, and perhaps
I need to emphasize more that it also applies to other algorithms,
including locking. Me, I learned this lesson from a logic analyzer
back in the very early 1990s. This was back in the days before on-CPU
caches when a logic analyzer could actually tell you something about
the detailed execution. ;-)

The key point is that you can often closely approximate the performance
of synchronization algorithms by counting the number of cache misses and
the number of CPUs competing for each cache line.

If you want to get the microbenchmark test code itself upstream,
one approach might be to have a kernel/locking/lockperf.c similar to
kernel/rcu/rcuperf.c.

Thoughts?

Thanx, Paul