Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Waiman Long
Date: Tue Sep 01 2020 - 13:38:46 EST


On 8/31/20 5:39 PM, Alex Kogan wrote:
On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@xxxxxxxxxx> wrote:

On 4/3/20 4:59 PM, Alex Kogan wrote:
In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the lock holder scans the
primary queue looking for a thread running on the same node (pre-scan). If
found (call it thread T), all threads in the primary queue between the
current lock holder and T are moved to the end of the secondary queue.
If such T is not found, we make another scan of the primary queue when
unlocking the MCS lock (post-scan), starting at the position where
pre-scan stopped. If both scans fail to find such T, the MCS lock is
passed to the first thread in the secondary queue. If the secondary queue
is empty, the lock is passed to the next thread in the primary queue.
For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@xxxxxxxxxx>
Reviewed-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
Reviewed-by: Waiman Long <longman@xxxxxxxxxx>
---
There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.
I understand the concern. While your patch addresses it, I am afraid it makes
the code somewhat more complex, and duplicates some of the slow path
functionality (e.g., the spin loop until the lock value changes to a certain
value).

Let me suggest a different idea for gradually restructuring the main queue
that has some similarity to the way you suggested to handle prioritized waiters.
Basically, instead of scanning the entire chain of main queue waiters,
we can check only the next waiter and, if present and it runs on a different
node, move it to the secondary queue. In addition, to maintain the preference
for a certain numa node ID, we set the numa node of the next-next waiter,
if present, to that of the current lock holder. This is the part similar to the
way you suggested to handle prioritized waiters.

This way, the worst case complexity of cna_order_queue() decreases from O(n)
down to O(1), as we always “scan" only one waiter. And as before, we change
the preference (and flush the secondary queue) after X handovers (or after
Y ms, as in your in other patch).

I attach the patch that applies on top of your patch for prioritized nodes
(0006), but does not include your patch 0007 (time based threshold),
which I will integrate into the series in the next revision.

Please, let me know what you think.

That is an interesting idea. I don't have any fundamental objection to that. I just wonder how it will impact the kind of performance test that you ran before. It would be nice to see the performance impact with that change.

Cheers,
Longman