Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlockimplementation

From: Waiman Long
Date: Thu Aug 01 2013 - 17:09:41 EST

Next message: David Miller: "Re: [PATCH V2 3/3] ethernet: Convert mac address uses of 6 toETH_ALEN"
Previous message: Stephen Warren: "Re: [PATCH 2/6] usb: phy: tegra: Fix wrong PHY parameters"
In reply to: Raghavendra K T: "Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlockimplementation"
Next in thread: Raghavendra K T: "Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlockimplementation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 08/01/2013 04:23 PM, Raghavendra K T wrote:

On 08/01/2013 08:07 AM, Waiman Long wrote:

+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+ if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == 0))
+ return 1;
+ return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+ if (likely(queue_spin_trylock(lock)))
+ return;
+ queue_spin_lock_slowpath(lock);
+}

quickly falling into slowpath may hurt performance in some cases. no?

Failing the trylock means that the process is likely to wait. I do retry one more time in the slowpath before waiting in the queue.

Instead, I tried something like this:

#define SPIN_THRESHOLD 64

static __always_inline void queue_spin_lock(struct qspinlock *lock)
{
unsigned count = SPIN_THRESHOLD;
do {
if (likely(queue_spin_trylock(lock)))
return;
cpu_relax();
} while (count--);
queue_spin_lock_slowpath(lock);
}

Though I could see some gains in overcommit, but it hurted undercommit
in some workloads :(.

The gcc 4.4.7 compiler that I used in my test machine has the tendency of allocating stack space for variables instead of using registers when a loop is present. So I try to avoid having loop in the fast path. Also the count itself is rather arbitrary. For the first pass, I would like to make thing simple. We can always enhance it once it is accepted and merged.

+/**
+ * queue_trylock - try to acquire the lock bit ignoring the qcode in lock
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_trylock(struct qspinlock *lock)
+{
+ if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0))
+ return 1;
+ return 0;
+}

It took long time for me to confirm myself that,
this is being used when we exhaust all the nodes. But not sure of
any better name so that it does not confuse with queue_spin_trylock.
anyway, they are in different files :).

Yes, I know it is confusing. I will change the name to make it more explicit.

Result:
sandybridge 32 cpu/ 16 core (HT on) 2 node machine with 16 vcpu kvm
guests.

In general, I am seeing undercommit loads are getting benefited by the patches.

base = 3.11-rc1
patched = base + qlock
+----+-----------+-----------+-----------+------------+-----------+
hackbench (time in sec lower is better)
+----+-----------+-----------+-----------+------------+-----------+
oc base stdev patched stdev %improvement
+----+-----------+-----------+-----------+------------+-----------+
0.5x 18.9326 1.6072 20.0686 2.9968 -6.00023
1.0x 34.0585 5.5120 33.2230 1.6119 2.45313
+----+-----------+-----------+-----------+------------+-----------+
+----+-----------+-----------+-----------+------------+-----------+
ebizzy (records/sec higher is better)
+----+-----------+-----------+-----------+------------+-----------+
oc base stdev patched stdev %improvement
+----+-----------+-----------+-----------+------------+-----------+
0.5x 20499.3750 466.7756 22257.8750 884.8308 8.57831
1.0x 15903.5000 271.7126 17993.5000 682.5095 13.14176
1.5x 1883.2222 166.3714 1742.8889 135.2271 -7.45177
2.5x 829.1250 44.3957 803.6250 78.8034 -3.07553
+----+-----------+-----------+-----------+------------+-----------+
+----+-----------+-----------+-----------+------------+-----------+
dbench (Throughput in MB/sec higher is better)
+----+-----------+-----------+-----------+------------+-----------+
oc base stdev patched stdev %improvement
+----+-----------+-----------+-----------+------------+-----------+
0.5x 11623.5000 34.2764 11667.0250 47.1122 0.37446
1.0x 6945.3675 79.0642 6798.4950 161.9431 -2.11468
1.5x 3950.4367 27.3828 3910.3122 45.4275 -1.01570
2.0x 2588.2063 35.2058 2520.3412 51.7138 -2.62209
+----+-----------+-----------+-----------+------------+-----------+

I saw dbench results improving to 0.3529, -2.9459, 3.2423, 4.8027
respectively after delaying entering to slowpath above.
[...]

I have not yet tested on bigger machine. I hope that bigger machine will
see significant undercommit improvements.

Thank for running the test. I am a bit confused about the terminology. What exactly do undercommit and overcommit mean?

Regards,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Miller: "Re: [PATCH V2 3/3] ethernet: Convert mac address uses of 6 toETH_ALEN"
Previous message: Stephen Warren: "Re: [PATCH 2/6] usb: phy: tegra: Fix wrong PHY parameters"
In reply to: Raghavendra K T: "Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlockimplementation"
Next in thread: Raghavendra K T: "Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlockimplementation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]