Re: REGRESSION: Performance regressions from switchinganon_vma->lock to mutex

From: Ingo Molnar
Date: Fri Jun 17 2011 - 15:08:34 EST

* Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote:

> > On 2.6.39, the contention of anon_vma->lock occupies 3.25% of cpu.
> > However, after the switch of the lock to mutex on 3.0-rc2, the mutex
> > acquisition jumps to 18.6% of cpu. This seems to be the main cause of
> > the 52% throughput regression.
> >
> This patch makes the mutex in Tim's workload take a bit less CPU time
> (4% down) but it doesn't really fix the regression. When spinning for a
> value it's always better to read it first before attempting to write it.
> This saves expensive operations on the interconnect.
> So it's not really a fix for this, but may be a slight improvement for
> other workloads.
> -Andi
> >From 34d4c1e579b3dfbc9a01967185835f5829bd52f0 Mon Sep 17 00:00:00 2001
> From: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> Date: Tue, 14 Jun 2011 16:27:54 -0700
> Subject: [PATCH] mutex: while spinning read count before attempting cmpxchg
> Under heavy contention it's better to read first before trying to
> do an atomic operation on the interconnect.
> This gives a few percent improvement for the mutex CPU time under
> heavy contention and likely saves some power too.
> Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> ---
> kernel/mutex.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index d607ed5..1abffa9 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -170,7 +170,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> if (owner && !mutex_spin_on_owner(lock, owner))
> break;
> - if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> + if (atomic_read(&lock->count) == 1 &&
> + atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> lock_acquired(&lock->dep_map, ip);
> mutex_set_owner(lock);
> preempt_enable();

What is *sorely* missing from your changelog (again) is the
explanation of *why* it improves performance in the contended case:
because the cacheline is brought into shared-read MESI state which
the CMPXCHG might not dirty if the CMPXCHG fails in the contended

Firstly, this reduces the cost of hard bounces somewhat because the
owner CPU still has the cacheline in read-shared state, which it can
invalidate from the other CPU's cache on unlock in a bit cheaper way
if it were forced to bounce the cacheline back.

Secondly, in the contended case more that 2 CPUs are looping on the
same cacheline and bringing it in shared and doing the cmpxchg loop
will not bounce the cacheline around (if CMPXCHG is smart enough to
not dirty the cacheline even in the failed case - this is CPU model
dependent) - it will spread to all interested CPUs in read-shared
state. This is most likely the dominant factor in Tim's tests.

Had you considered and described all that then you'd inevitably have
been led to consider the much more important issue that is missing
from your patch: the analysis of what happens to the cacheline under
*light* contention, which is by far the most important case ...

In the lightly contended case it's ultimately bad to bring in the
cacheline as read-shared first, because the cmpxchg will have to go
out to the MESI interconnect *again*: this time to flush the
cacheline from the previous owner CPU's cache.

So i don't think we want your patch, without some really good
supporting facts and analysis that show that the lightly contended
case does not suffer.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at