Re: [External] : Re: [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Alex Kogan
Date: Tue Apr 13 2021 - 22:30:13 EST


Peter, thanks for all the comments and suggestions!

> On Apr 13, 2021, at 7:30 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, Apr 01, 2021 at 11:31:53AM -0400, Alex Kogan wrote:
>
>> +/*
>> + * cna_splice_tail -- splice the next node from the primary queue onto
>> + * the secondary queue.
>> + */
>> +static void cna_splice_next(struct mcs_spinlock *node,
>> + struct mcs_spinlock *next,
>> + struct mcs_spinlock *nnext)
>
> You forgot to update the comment when you changed the name on this
> thing.
Good catch, thanks.

>
>> +/*
>> + * cna_order_queue - check whether the next waiter in the main queue is on
>> + * the same NUMA node as the lock holder; if not, and it has a waiter behind
>> + * it in the main queue, move the former onto the secondary queue.
>> + */
>> +static void cna_order_queue(struct mcs_spinlock *node)
>> +{
>> + struct mcs_spinlock *next = READ_ONCE(node->next);
>> + struct cna_node *cn = (struct cna_node *)node;
>> + int numa_node, next_numa_node;
>> +
>> + if (!next) {
>> + cn->partial_order = LOCAL_WAITER_NOT_FOUND;
>> + return;
>> + }
>> +
>> + numa_node = cn->numa_node;
>> + next_numa_node = ((struct cna_node *)next)->numa_node;
>> +
>> + if (next_numa_node != numa_node) {
>> + struct mcs_spinlock *nnext = READ_ONCE(next->next);
>> +
>> + if (nnext) {
>> + cna_splice_next(node, next, nnext);
>> + next = nnext;
>> + }
>> + /*
>> + * Inherit NUMA node id of primary queue, to maintain the
>> + * preference even if the next waiter is on a different node.
>> + */
>> + ((struct cna_node *)next)->numa_node = numa_node;
>> + }
>> +}
>
> So the obvious change since last time I looked a this is that it now
> only looks 1 entry ahead. Which makes sense I suppose.
This is in response to the critique that the worst-case time complexity of
cna_order_queue() was O(n). With this change, the complexity is constant.

>
> I'm not really a fan of the 'partial_order' name combined with that
> silly enum { LOCAL_WAITER_FOUND, LOCAL_WAITER_NOT_FOUND }. That's just
> really bad naming all around. The enum is about having a waiter while
> the variable is about partial order, that doesn't match at all.
Fair enough.

> If you rename the variable to 'has_waiter' and simply use 0,1 values,
> things would be ever so more readable. But I don't think that makes
> sense, see below.
>
> I'm also not sure about that whole numa_node thing, why would you
> over-write the numa node, why at this point ?
With this move-one-by-one approach, I want to keep the NUMA-node
preference of the lock holder even if the next-next waiter is on a different
NUMA-node. Otherwise, we will end up switching preference often and
the entire scheme would not perform well. In particular, we might easily
end up with threads from the preferred node waiting in the secondary queue.

>
>> +
>> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
>> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>> + struct mcs_spinlock *node)
>> +{
>> + /*
>> + * Try and put the time otherwise spent spin waiting on
>> + * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
>> + */
>> + cna_order_queue(node);
>> +
>> + return 0; /* we lied; we didn't wait, go do so now */
>
> So here we inspect one entry ahead and then quit. I can't rmember, but
> did we try something like:
>
> /*
> * Try and put the time otherwise spent spin waiting on
> * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> * Move one entry at a go until either the list is fully
> * sorted or we ran out of spin condition.
> */
> while (READ_ONCE(lock->val) & _Q_LOCKED_PENDING_MASK &&
> node->partial_order)
> cna_order_queue(node);
>
> return 0;
>
> This will keep moving @next to the remote list until such a time that
> we're forced to continue or @next is local.
We have not tried that. This is actually an interesting idea, with its pros and cons.
That is, we are likely to filter out “non-preferred” waiters into the secondary queue
faster, but also we are more likely to run into a situation where the lock becomes
available at the time we are running the cna_order_queue() logic, thus prolonging
the handover time.

With this change, however, there is no need to call cna_order_queue() again
in cna_lock_handoff(), which is another advantage I guess.

All in all, I am fine switching to this alternative if you like it more.

BTW, we can return node->partial_order instead of 0, to skip the call to
atomic_cond_read_acquire() in the general code.

>
>> +}
>> +
>> +static inline void cna_lock_handoff(struct mcs_spinlock *node,
>> + struct mcs_spinlock *next)
>> +{
>> + struct cna_node *cn = (struct cna_node *)node;
>> + u32 val = 1;
>> +
>> + u32 partial_order = cn->partial_order;
>> +
>> + if (partial_order == LOCAL_WAITER_NOT_FOUND)
>> + cna_order_queue(node);
>> +
>
> AFAICT this is where playing silly games with ->numa_node belong; but
> right along with that goes a comment that describes why any of that
> makes sense.
>
> I mean, if you leave your node, for any reason, why bother coming back
> to it, why not accept it is a sign of the gods you're overdue for a
> node-change?
I think there is some misunderstanding here — let me try to clarify.
In the current logic, we first call cna_order_queue() from cna_wait_head_or_lock().
If we find an appropriate “local” waiter (either real or fake), we set it as
next and make sure the numa_node of that node is the same as ours.
If not (i.e., when we do not have any next waiter), we set partial_order to
LOCAL_WAITER_NOT_FOUND (pardon the names) and go spin on the lock (calling
atomic_cond_read_acquire() in the generic code).
During this spin time, new waiters can join the queue.
Hence, we recheck the state of the queue by calling cna_order_queue() again
from cna_lock_handoff().

As just mentioned, if we change the logic as you suggest, there is
really no reason to call cna_order_queue() again.

>
> Was the efficacy of this scheme tested?
>
>> + /*
>> + * We have a local waiter, either real or fake one;
>> + * reload @next in case it was changed by cna_order_queue().
>> + */
>> + next = node->next;
>> + if (node->locked > 1)
>> + val = node->locked; /* preseve secondary queue */
>
> IIRC we used to do:
>
> val |= node->locked;
>
> Which is simpler for not having branches. Why change a good thing?
With |= we might set val to encoded tail+1 (rather than encoded tail).
This still would work, as decode_tail(val) does not care about LSB, but seems fragile to me.
We talked about that before:
https://lore.kernel.org/linux-arm-kernel/E32F90E2-F00B-423B-A851-336004EF6593@xxxxxxxxxx

If you think this is a non-issue, I will be happy to adopt your suggestion.

Regards,
— Alex