Re: [PATCH] lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll

From: Mateusz Guzik
Date: Mon Jan 23 2023 - 10:59:20 EST


On 1/23/23, Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
> On Thu, Jan 19, 2023 at 1:47 PM David Laight <David.Laight@xxxxxxxxxx>
> wrote:
>>
>> > BTW: Recently, it was determined [1] that the usage of cpu_relax()
>> > inside the cmpxchg loop can be harmful for performance. We actually
>> > have the same situation here, so perhaps cpu_relax() should be removed
>> > in the same way it was removed from the lockref.
>>
>> I'm not sure you can ever want a cpu_relax() in a loop that
>> is implementing an atomic operation.
>> Even the ia64 (die...) issue was with a loop that was waiting
>> for another cpu to change the location (eg a spinlock).
>>
>> For an atomic operation an immediate retry is likely to succeed.
>> Any kind of deferral to an another cpu can only make it worse.
>>
>> Clearly if you have 100s of cpu looping doing atomic operation
>> on the same cache line it is likely that some get starved.
>> But to fix that you need to increase the time between successful
>> operations, not delay on failure.
>
> I would like to point out that the wikipedia article on
> compare-and-swap claims [1] that:
>
> Instead of immediately retrying after a CAS operation fails,
> researchers have found that total system performance can be improved
> in multiprocessor systems—where many threads constantly update some
> particular shared variable—if threads that see their CAS fail use
> exponential backoff—in other words, wait a little before retrying the
> CAS [2].
>
> [1] https://en.wikipedia.org/wiki/Compare-and-swap#Overview
> [2] https://arxiv.org/pdf/1305.5800.pdf
>

I skimmed the paper and I think I got the gist of it, which boils down
to a very common idea concerning damage control of multiple CPUs
modifying the same thing.

The idea boils down to making some of the contending CPUs bugger off
for a time. How to specifically do it is another matter. Key problem
with it is that CPUs which already failed to make any progress now
artificially delay the entire thing even more.

Fundamental observation is that 2 or more CPUs rolling with an atomic
op on the same cacheline are going to interfere with each other and
that effect is only going to get worse as you keep adding more of them.
If you can make some of them temporarily bugger off, that is a win in
the sense that the ping-pong is reduced. If this is finely tuned, you
may get an overall win with by achieving better throughput while
maintaining acceptable worst case latency. Otherwise you are starving
other participants which is not acceptable. tl;dr you are walking a very
right rope here and I'm not sure it is worth it.

For a toy example one can imagine 2 CPUs running code to do an op n
times. If they both do it in a tight loop there is plenty of ping-pong,
but if one is simply made to wait until the other one is finished with
all the iterations, you register much shorter total real time. But then
again everything happening on the other CPU had to wait so much longer.

I don't know why but many people love to throw away the value returned
by CAS and instead do another plain access. While I don't see actual
code used, the description in the paper matches this problematic
behavior and I guarantee it artificially makes a tight CAS loop appear
worse than it is.

My own numbers from the cpu_relax removal commit show that throughput
flattens very early, and all the extra computing power thrown at it
simply goes to waste.

All that said. Is a plain cpu_relax in the loop good? Definitely not on
x86-64. Is something more elaborate going to be good? Maybe, highly
depends on CPUs at hand and the workload, and if you mess it up some of
them get starved. This in turn can degrade performance elsewhere if the
starved thread is holding a lock waited on by someone else, and so on.
I would definitely not play any games of the sort in the code at hand.

In that case what about lockref? It definitely is not the fastest it
can be, for one some provisions could be made for NUMA-aware handling.
Basic idea would be to have competing CPUs aggregate the total change
on the lock and then have one of them make it happen across the
interconnect -- so instead of add/sub 1 you would add/sub n, where n
can very well be quite big.

Personally, for lockref, I think the way to go forward is to use it less
to begin with and to look for ways to convert it into a lock xadd-able
primitive instead. The "doing it less thing" could be done by
implementing a RCU-only mode for select syscalls, which defaults to
opportunistically avoid refing squat. If the caller manages to do what
it needed to do, that is a massive win; otherwise refs get taken. This
could work well in the common case for syscalls like statx and access,
but would not for open. Frankly I'm surprised something like this is
not already implemented (maybe I missed a major showstopper?).

--
Mateusz Guzik <mjguzik gmail.com>