Re: [tip:core/locking] x86/smp: Move waiting on contended ticketlock out of line

From: Rik van Riel
Date: Wed Feb 27 2013 - 11:42:45 EST

On 02/13/2013 08:21 PM, Linus Torvalds wrote:
On Wed, Feb 13, 2013 at 3:41 PM, Rik van Riel <riel@xxxxxxxxxx> wrote:

I have an example of the second case. It is a test case
from a customer issue, where an application is contending on
semaphores, doing semaphore lock and unlock operations. The
test case simply has N threads, trying to lock and unlock the
same semaphore.

The attached graph (which I sent out with the intro email to
my patches) shows how reducing the memory accesses from the
spinlock wait path prevents the large performance degradation
seen with the vanilla kernel. This is on a 24 CPU system with
4 6-core AMD CPUs.

The "prop-N" series are with a fixed delay proportional back-off.
You can see that a small value of N does not help much for large
numbers of cpus, and a large value hurts with a small number of
CPUs. The automatic tuning appears to be quite robust.

Ok, good, so there are some numbers. I didn't see any in the commit
messages anywhere, and since the threads I've looked at are from
tip-bot, I never saw the intro email.

Some people at HP have collected an extensive list of AIM 7 results,
all the different AIM 7 workloads, on an 80-core DL-980, with HT

The AIM7 workloads all work by slowly increasing the number of
worker processes, all of which have some duty cycle (busy & sleep).
Adding more processes tends to increase the number of jobs/minute
completed, up to a certain point. For some workloads, the system
has a performance peak and performance levels up at or near that
peak, for other workloads performance drops when more processes
are added beyond the peak, and performance drops to a lower plateau.

To keep the results readable and relevant, I am reporting the
plateau performance numbers. Comments are given where required.

3.7.6 vanilla 3.7.6 w/ backoff

all_utime 333000 333000
alltests 300000-470000 180000-440000 large variability
compute 528000 528000
custom 290000-320000 250000-330000 4 fast runs, 1 slow
dbase 920000 925000
disk 100000 90000-120000 similar plateau, wild
swings with patches
five_sec 140000 140000
fserver 160000-300000 250000-430000 w/ patch drops off at
higher number of users
high_systime 80000-110000 30000-125000 w/ patch mostly 40k-70k,
wild wings
long no performance platform, equal performance for both
new_dbase 960000 96000
new_fserver 150000-300000 210000-420000 vanilla drops off,
w/ patches wild swings
shared 270000-440000 120000-440000 all runs ~equal to
vanilla up to 1000
users, one out of 5
runs slows down past
1100 users
short 120000 190000

In conclusion, the spinlock backoff patches seem to significantly
improve performance in workloads where there is simple contention
on just one or two spinlocks. However, in more complex workloads,
high variability is seen, including performance regression in some
test runs.

One hypothesis is that before the spinlock backoff patches, the
workloads get contention (and bottleneck) on multiple locks. With
the patches, the contention on some of the locks is relieved, and
more tasks bunch up on the remaining bottlenecks, leading to worse

That said, it's interesting that this happens with the semaphore path.
We've had other cases where the spinlock in the *sleeping* locks have
caused problems, and I wonder if we should look at that path in

If we want to get reliable improved performance without unpredictable
performance swings, we should probably change some of the kernel's
spinlocks, especially the ones embedded in sleeping locks, into
scalable locks like Michel's implementation of MCS locks.

We may be hitting the limit of what can be done with the current
ticket lock data structure. It simply may not scale as far as the
hardware on which Linux is being run.

All rights reversed
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at