Re: [PATCH-tip v3 02/14] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER

From: Waiman Long
Date: Fri Apr 12 2019 - 22:25:15 EST


On 04/12/2019 02:05 PM, Waiman Long wrote:
> On 04/12/2019 12:41 PM, Ingo Molnar wrote:
>>
>> So beyond the primary constraint of PeterZ OK-ing it all, there's also
>> these two scalability regression reports from the ktest bot:
>>
>> [locking/rwsem] 1b94536f2d: stress-ng.bad-altstack.ops_per_sec -32.7% regression
> A regression due to the lock handoff patch is kind of expected, but I
> will into why there is such a large drop.

I don't have a high core count system on hand. I run the stress-ng test
on a 2-socket 40-core 80-thread Skylake system:


Kernels: 1) Before lock handoff patch
ÂÂÂÂÂÂÂÂ 2) After lock handoff patch
ÂÂÂÂÂÂÂÂ 3) After wake all reader patch
ÂÂÂÂÂÂÂÂ 4) After reader spin on writer patch
ÂÂÂÂÂÂÂÂ 5) After writer spin on reader patch
ÂÂÂÂÂÂÂÂ
ÂÂÂ TestsÂÂÂÂÂÂÂÂ K1ÂÂÂÂÂ K2ÂÂÂÂÂ K3ÂÂÂÂÂ K4ÂÂÂÂÂ K5
ÂÂÂ -----ÂÂÂÂÂÂÂÂ --ÂÂÂÂÂ --ÂÂÂÂÂ --ÂÂÂÂÂ --ÂÂÂÂÂ --
 bad-altstack 39928 35807 36422 40062 40747
 stackmmap 187 365 435 255 198
 vm 309589 296097 262045 281974 310439
 vm-segv 113776 114058 112318 115422 110550
Â
Here, the bad-altstack dropped 10% after the lock handoff patch. However,
the performance is recovered with later patches. The stackmmap results
don't look quite right as the numbers are much smaller than the numbers
in the report.

I will rerun the tests again when I acquire a high core count system.

Anyway, the lock handoff patch is expected to reduce throughput under
heavy contention.

>> [locking/rwsem] adc32e8877: will-it-scale.per_thread_ops -21.0% regression
> Will look into that also.

I can reproduce the regression on the same skylake system.

The results of the page_fault1 will-it-scale test are as follows:

ÂThreadsÂÂ K2ÂÂÂÂÂ K3ÂÂÂÂÂÂ K4ÂÂÂÂÂÂ K5
Â-------ÂÂ --ÂÂÂÂÂ --ÂÂÂÂÂÂ --ÂÂÂÂÂÂ --
ÂÂÂ 20Â 5549772Â 5550332Â 5463961Â 5400064
ÂÂÂ 40Â 9540445 10286071Â 9705062Â 7706082
ÂÂÂ 60Â 8187245Â 8212307Â 7777247Â 6647705
ÂÂÂ 89Â 8390758Â 9619271Â 9019454Â 7124407

So the wake-all-reader patch is good for this benchmark. The performance
was reduced a bit with the reader-spin-on-writer patch. It got even worse
with the writer-spin-on-reader patch.

I looked at the perf output, rwsem contention accounted for less than
1% of the total cpu cycles. So I believe the regression was caused by
the behavior change introduced by the two reader optimistic spinning
patches. These patch will make writer less preferred than before. I
think the performance of this microbenchmark may be more dependent on
writer performance.

Looking at the lock event counts for K5:

Ârwsem_opt_fail=253647
Ârwsem_opt_nospin=8776
Ârwsem_opt_rlock=259941
Ârwsem_opt_wlock=2543
Ârwsem_rlock=237747
Ârwsem_rlock_fail=0
Ârwsem_rlock_fast=0
Ârwsem_rlock_handoff=0
Ârwsem_sleep_reader=237747
Ârwsem_sleep_writer=23098
Ârwsem_wake_reader=6033
Ârwsem_wake_writer=47032
Ârwsem_wlock=15890
Ârwsem_wlock_fail=10
Ârwsem_wlock_handoff=3991

For K4, it was

Ârwsem_opt_fail=479626
Ârwsem_opt_rlock=8877
Ârwsem_opt_wlock=114
Ârwsem_rlock=453874
Ârwsem_rlock_fail=0
Ârwsem_rlock_fast=1234
Ârwsem_rlock_handoff=0
Ârwsem_sleep_reader=453058
Ârwsem_sleep_writer=25836
Ârwsem_wake_reader=11054
Ârwsem_wake_writer=71568
Ârwsem_wlock=24515
Ârwsem_wlock_fail=3
Ârwsem_wlock_handoff=5245

It can be seen that a lot more readers got the lock via optimistic
spinning. One possibility is that reader optimistic spinning causes
readers to spread out into more lock acquisition groups than without. The
K3 results show that grouping more readers into one lock acquisition
group help to improve performance for this microbenchmark. I will need
to run more tests to find out the root cause of this regression. It is
not an easy problem to solve.

In the mean time, I am going to send out an updated patchset tomorrow so
that Peter can review the patch again when he is available.

Cheers,
Longman