Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lockout of line

From: Linus Torvalds
Date: Wed Feb 13 2013 - 14:36:32 EST

On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel <riel@xxxxxxxxxx> wrote:
> The spinlock backoff code prevents these last cases from
> experiencing large performance regressions when the hardware
> is upgraded.

I still want *numbers*.

There are real cases where backoff does exactly the reverse, and makes
things much much worse. The tuning of the backoff delays are often
*very* hardware sensitive, and upgrading hardware can turn out to do
exactly what you say - but for the backoff, not the regular spinning

And we have hardware that actually autodetects some cacheline bouncing
patterns and may actually do a better job than software. It's *hard*
for software to know whether it's bouncing within the L1 cache between
threads, or across fabric in a large machine.

> As a car analogy, think of this not as an accelerator, but
> as an airbag. Spinlock backoff (or other scalable locking
> code) exists to keep things from going horribly wrong when
> we hit a scalability wall.
> Does that make more sense?

Not without tons of numbers from many different platforms, it doesn't.
And not without explaining which spinlock it is that is so contended
in the first place.

We've been very good at fixing spinlock contention. Now, that does
mean that what is likely left isn't exactly low-hanging fruit, but it
also means that the circumstances where it triggers are probably quite

So I claim:

- it's *really* hard to trigger in real loads on common hardware.

- if it does trigger in any half-way reasonably common setup
(hardware/software), we most likely should work really hard at fixing
the underlying problem, not the symptoms.

- we absolutely should *not* pessimize the common case for this

So I suspect contention is something that you *may* need on some
particular platforms ("Oh, I have 64 sockets adn 1024 threads, I can
trigger contention easily"), but that tends to be unusual, and any
back-off code should be damn aware of the fact that it only helps the

Hurting the 99.99% even a tiny amount should be something we should
never ever do. This is why I think the fast case is so important (and
I had another email about possibly making it acceptable), but I think
the *slow* case should be looked at a lot too. Because "back-off" is
absolutely *not* necessarily hugely better than plain spinning, and it
needs numbers. How many times do you spin before even looking at
back-off? How much do you back off? How do you account for hardware
that notices busy loops and turns them into effectively just mwait?

Btw, the "notice busy loops and turn it into mwait" is not some
theoretical magic thing. And it's exactly the kind of thing that
back-off *breaks* by making the loop too complex for hardware to
understand. Even just adding counters with conditionals that are *not*
about just he value loaded from memory suddently means that hardware
has a lot harder time doing things like that.

And "notice busy loops and turn it into mwait" is actually a big deal
for power use of a CPU. Back-off with busy-looping timing waits can be
an absolutely *horrible* thing for power use. So we have bigger
issues than just performance, there's complex CPU power behavior too.
Being "smart" can often be really really hard.

I don't know if you perhaps had some future plans of looking at using
mwait in the backoff code itself, but the patch I did see looked like
it might be absolutely horrible. How long does a "cpu_relax()" wait?
Do you know? How does "cpu_relax()" interface with the rest of the
CPU? Do you know? Because I've heard noises about cpu_relax() actually
affecting the memory pipe behavior of cache accesses of the CPU, and
thus the "cpu_relax()" in a busy loop that does *not* look at the
value (your "backoff delay loop") may actually work very very
differently from the cpu_relax() in the actual "wait for the value to
change" loop.

And how does that account for two different microarchitectures doing
totally different things? Maybe one uarch makes cpu_relax() just shut
down the front-end for a while, while another does something much
fancier and gives hints to in-flight memory accesses etc?

When do you start doing mwait vs just busy-looping with cpu_relax? How
do you tune it to do the right thing for different architectures?

So I think this is complex. At many different levels. And it's *all*
about the details. No handwaving about how "back-off is like a air
bag". Because the big picture is entirely and totally irrelevant, when
the details are almost all that actually matter.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at