Re: [RFC][PATCH 0/3] locking/mutex: Rewrite basic mutex

From: Waiman Long
Date: Tue Aug 23 2016 - 19:10:11 EST


On 08/23/2016 04:41 PM, Peter Zijlstra wrote:
On Tue, Aug 23, 2016 at 03:36:17PM -0400, Waiman Long wrote:
I think this is the right way to go. There isn't any big change in the
slowpath, so the contended performance should be the same. The fastpath,
however, will get a bit slower as a single atomic op plus a jump instruction
(a single cacheline load) is replaced by a read-and-test and compxchg
(potentially 2 cacheline loads) which will be somewhat slower than the
optimized assembly code.
Yeah, I'll try and run some workloads tomorrow if you and Jason don't
beat me to it ;-)

Alternatively, you can replace the
__mutex_trylock() in mutex_lock() by just a blind cmpxchg to optimize the
fastpath further.
Problem with that is that we need to preserve the flag bits, so we need
the initial load.

Or were you thinking of: cmpxchg(&lock->owner, 0UL, (unsigned
long)current), which only works on uncontended locks?

Yes, that is what I was thinking about. It was a lesson learned in my qspinlock patch. I used to do a TATAS in the locking fastpath. Then I was told that we should optimize the for the uncontended case. So I changed the fastpath to just TAS. I am sure if the same rule should apply for mutex or not.

A cmpxhcg will still be a tiny bit slower than other
atomic ops, but it will be more acceptable, I think.
I don't think cmpxchg is much slower than say xadd or xchg, the typical
problem with cmpxchg is the looping part, but single instruction costs
should be similar.

My testing in the past showed that cmpxchg was tiny bit slower than xchg or atomic_inc, for example. In this context, the performance difference, if any, should not be noticeable.

Cheers,
Longman