On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote:Kernel JPM %Change from (1)But wouldn't kernel (4) be the one that was the most highly constrained?
------ --- ----------------
1 148265 -
2 238715 +61%
3 242048 +63%
4 234881 +58%
The use of unfair qrwlock provides a small boost of 2%, while using
fair qrwlock leads to 3% decrease of performance. However, looking
at the perf profiles, we can clearly see that other bottlenecks were
constraining the performance improvement.
Perf profile of kernel (2):
18.20% reaim [kernel.kallsyms] [k] __write_lock_failed
9.36% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.91% reaim [kernel.kallsyms] [k] mspin_lock
2.73% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.23% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.29% reaim [kernel.kallsyms] [k] __read_lock_failed
1.21% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.14% reaim [kernel.kallsyms] [k] zap_pte_range
1.13% reaim [kernel.kallsyms] [k] _raw_spin_lock
1.04% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
Perf profile of kernel (3):
10.57% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
7.98% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
5.83% reaim [kernel.kallsyms] [k] mspin_lock
2.86% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.71% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
1.52% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.51% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
1.35% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
1.12% reaim [kernel.kallsyms] [k] zap_pte_range
1.06% reaim [kernel.kallsyms] [k] perf_event_aux_ctx
1.01% reaim [kernel.kallsyms] [k] perf_event_aux
(That said, yes, I get that _raw_spin_lock_irqsave() is some lock that
is unrelated to the qrwlock.)
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+ union qrwcnts rwcnts;
+
+ rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
+ return !rwcnts.writer&& !rwcnts.readers;
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_read_trylock(struct qrwlock *lock)
+{
+ union qrwcnts cnts;
+ u8 wmask;
+
+ cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+ wmask = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
+ if (likely(!(cnts.writer& wmask))) {
+ cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
On an unfair lock, this can momentarily make queue_read_can_lock() give
a false positive. Not sure that this is a problem -- after all, the
return value from queue_read_can_lock() is immediately obsolete anyway.
+/**How about the new smp_store_release() for this write? Looks to me that
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+ /*
+ * Make sure that none of the critical section will be leaked out.
+ */
+ smp_mb__before_clear_bit();
+ ACCESS_ONCE(lock->cnts.writer) = 0;
+ smp_mb__after_clear_bit();
smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident,
if they in fact do work for all architectures.
+/*Sometimes, anyway! (Referring to your performance results on top of
+ * Compared with regular rwlock, the queue rwlock has has the following
+ * advantages:
+ * 1. It is more deterministic for the fair variant. Even though there is
+ * a slight chance of stealing the lock if come at the right moment, the
+ * granting of the lock is mostly in FIFO order. Even the default unfair
+ * variant is fairer at least among the writers.
+ * 2. It is faster in high contention situation.
Ingo's patch.)
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to be added to the queue
+ *
+ * The use of smp_wmb() is to make sure that the other CPUs see the change
+ * ASAP.
+ */
+static __always_inline void
+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *prev;
+
+ node->next = NULL;
+ node->wait = true;
+ prev = xchg(&lock->waitq, node);
+ if (prev) {
+ prev->next = node;
+ smp_wmb();
This smp_wmb() desperately needs a comment. Presumably it is ordering
the above "prev->next = node" with some later write, but what write?
Oh... I see the header comment above.
Actually, memory barriers don't necessarily make things visible sooner.
They are instead used for ordering. Or did you actually measure a
performance increase with this? (Seems -highly- unlikely given smp_wmb()'s
definition on x86...)
+ /*Because smp_wmb() does not order reads, reads from the critical section
+ * Wait until the waiting flag is off
+ */
+ while (ACCESS_ONCE(node->wait))
+ cpu_relax();
+ }
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to the current head of queue
+ */
+static __always_inline void
+signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *next;
+
+ /*
+ * Try to notify the next node first without disturbing the cacheline
+ * of the lock. If that fails, check to see if it is the last node
+ * and so should clear the wait queue.
+ */
+ next = ACCESS_ONCE(node->next);
+ if (likely(next))
+ goto notify_next;
+
+ /*
+ * Clear the wait queue if it is the last node
+ */
+ if ((ACCESS_ONCE(lock->waitq) == node)&&
+ (cmpxchg(&lock->waitq, node, NULL) == node))
+ return;
+ /*
+ * Wait until the next one in queue set up the next field
+ */
+ while (likely(!(next = ACCESS_ONCE(node->next))))
+ cpu_relax();
+ /*
+ * The next one in queue is now at the head
+ */
+notify_next:
+ barrier();
+ ACCESS_ONCE(next->wait) = false;
+ smp_wmb();
could leak out of the critical section. A full memory barrier (smp_mb())
seems necessary to avoid this.
Yes, you do have full memory barriers implicit in various atomic operations,
but it appears to be possible to avoid them all in some situations.
+/**Suppose that there now is a writer, but no readers...
+ * queue_write_3step_lock - acquire write lock in 3 steps
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 otherwise
+ *
+ * Step 1 - Try to acquire the lock directly if no reader is present
+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
+ * Step 3 - When the readers field goes to 0, set the locked flag
+ *
+ * When not in fair mode, the readers actually ignore the second step.
+ * However, this is still necessary to force other writers to fall in line.
+ */
+static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
+{
+ union qrwcnts old, new;
+
+ old.rw = ACCESS_ONCE(lock->cnts.rw);
+
+ /* Step 1 */
+ if (!old.writer& !old.readers) {
+ new.rw = old.rw;
+ new.writer = QW_LOCKED;
+ if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
+ return 1;
+ }
+
+ /* Step 2 */
+ if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
+ return 0;
+
+ /* Step 3 */
+ while (true) {
+ cpu_relax();
+ old.rw = ACCESS_ONCE(lock->cnts.rw);
+ if (!old.readers) {... can't this mistakenly hand out the lock to a second writer?
+ new.rw = old.rw;
+ new.writer = QW_LOCKED;
+ if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
+ == old.rw))
Ah, the trick is that we are at the head of the queue, so the only writer
we can possibly contend with is a prior holder of the lock. Once that
writer leaves, no other writer but can appear. And the QW_WAITING bit
prevents new writers from immediately grabbing the lock.
+ return 1;If we get here, queue_write_3step_lock() just executed a successful
+ }
+ }
+ /* Should never reach here */
+ return 0;
+}
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+ struct qrwnode node;
+
+ /*
+ * Put the writer into the wait queue
+ */
+ wait_in_queue(lock,&node);
+
+ /*
+ * At the head of the wait queue now, call queue_write_3step_lock()
+ * to acquire the lock until it is done.
+ */
+ while (!queue_write_3step_lock(lock))
+ cpu_relax();
cmpxchg(), which implies a full memory barrier. This prevents the
critical section from leaking out, good!
+ signal_next(lock,&node);--
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);
--
1.7.1
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/