Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock

From: Waiman Long
Date: Thu Jan 23 2020 - 14:08:41 EST


This is a multi-part message in MIME format.On 1/23/20 10:25 AM, Waiman Long wrote:
> On 1/23/20 6:35 AM, Will Deacon wrote:
>> Hi folks,
>>
>> (I think Lihao is travelling at the moment, so he may be delayed in his
>> replies)
>>
>> On Wed, Jan 22, 2020 at 12:24:58PM -0500, Waiman Long wrote:
>>> On 1/22/20 6:45 AM, Lihao Liang wrote:
>>>> On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@xxxxxxxxxx> wrote:
>>>>> Summary
>>>>> -------
>>>>>
>>>>> Lock throughput can be increased by handing a lock to a waiter on the
>>>>> same NUMA node as the lock holder, provided care is taken to avoid
>>>>> starvation of waiters on other NUMA nodes. This patch introduces CNA
>>>>> (compact NUMA-aware lock) as the slow path for qspinlock. It is
>>>>> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>>>>>
>>>> Thanks for your patches. The experimental results look promising!
>>>>
>>>> I understand that the new CNA qspinlock uses randomization to achieve
>>>> long-term fairness, and provides the numa_spinlock_threshold parameter
>>>> for users to tune. As Linux runs extremely diverse workloads, it is not
>>>> clear how randomization affects its fairness, and how users with
>>>> different requirements are supposed to tune this parameter.
>>>>
>>>> To this end, Will and I consider it beneficial to be able to answer the
>>>> following question:
>>>>
>>>> With different values of numa_spinlock_threshold and
>>>> SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
>>>> sockets have to wait to acquire the lock? This is particularly relevant
>>>> in high contention situations when new threads keep arriving on the same
>>>> socket as the lock holder.
>>>>
>>>> In this email, I try to provide some formal analysis to address this
>>>> question. Let's assume the probability for the lock to stay on the
>>>> same socket is *at least* p, which corresponds to the probability for
>>>> the function probably(unsigned int num_bits) in the patch to return *false*,
>>>> where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the
>>>> function.
>>> That is not strictly true from my understanding of the code. The
>>> probably() function does not come into play if a secondary queue is
>>> present. Also calling cna_scan_main_queue() doesn't guarantee that a
>>> waiter in the same node can be found. So the simple mathematical
>>> analysis isn't that applicable in this case. One will have to do an
>>> actual simulation to find out what the actual behavior will be.
>> It's certainly true that the analysis is based on the worst-case scenario,
>> but I think it's still worth considering. For example, the secondary queue
>> does not exist initially so it seems a bit odd that we only instantiate it
>> with < 1% probability.
>>
>> That said, my real concern with any of this is that it makes formal
>> modelling and analysis of the qspinlock considerably more challenging. I
>> would /really/ like to see an update to the TLA+ model we have of the
>> current implementation [1] and preferably also the userspace version I
>> hacked together [2] so that we can continue to test and validate changes
>> to the code outside of the usual kernel stress-testing.
> I do agree that the current CNA code is hard to model. The CNA lock
> behaves like a regular qspinlock in many cases. If the lock becomes
> fairly contended with waiters from different nodes, it will
> opportunistically switch to CNA mode where preference is given to
> waiters in the same node.

BTW, I added the attached draft lock_event patch on top of the v9 CNA
patch series to observe the behavior of the CNA lock. Using a 2-socket
96-thread x86-64 server, the lock event output after boot up was:

cna_intra_max=1942
cna_mainscan_hit=134
cna_merge_queue=73
cna_prescan_hit=16662
cna_prescan_miss=268
cna_splice_new=352
cna_splice_old=2415
lock_pending=130090
lock_slowpath=191868
lock_use_node2=135

After resetting the counts and running a 96-thread lock stress test for
10s, I got

cna_intra_max=65536
cna_mainscan_hit=46
cna_merge_queue=661
cna_prescan_hit=42486841
cna_prescan_miss=68
cna_splice_new=676
cna_splice_old=402
lock_pending=11012
lock_slowpath=44332335
lock_use_node2=57203

So the cna_intra_max does go to the maximum of 64k.

Cheers,
Longman