Re: [PATCH] lockref: remove cpu_relax() again

From: Waiman Long
Date: Thu Sep 05 2013 - 13:54:49 EST

Next message: Matthew Wilcox: "Re: RFC Block Layer Extensions to Support NV-DIMMs"
Previous message: Linus Torvalds: "Re: [PATCH] lockref: remove cpu_relax() again"
In reply to: Tony Luck: "Re: [PATCH] lockref: remove cpu_relax() again"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 09/05/2013 11:31 AM, Linus Torvalds wrote:

On Thu, Sep 5, 2013 at 6:18 AM, Heiko Carstens
<heiko.carstens@xxxxxxxxxx> wrote:

*If* however the cpu_relax() makes sense on other platforms maybe we could
add something like we have already with "arch_mutex_cpu_relax()":

I actually think it won't.

The lockref cmpxchg isn't waiting for something to change - it only
loops _if_ something has changed, and rather than cpu_relax(), we most
likely want to try to take advantage of the fact that we have the
changed data in our exclusive cacheline, and try to get our ref update
out as soon as possible.

IOW, the lockref loop is not an idle loop like a spinlock "wait for
lock to be released", it's very much an active loop of "oops,
something changed".

And there can't be any livelock, since by definition somebody else
_did_ make progress. In fact, adding the cpu_relax() probably just
makes things much less fair - once somebody else raced on you, the
cpu_relax() now makes it more likely that _another_ cpu does so too.

That said, let's see Tony's numbers are. On x86, it doesn't seem to
matter, but as Tony noticed, the variability can be quite high (for
me, the numbers tend to be quite stable when running the test program
multiple times in a loop, but then variation between boots or after
having done something else can be quite big - I suspect the cache
access patterns end up varying wildly with different dentry layout and
hash chain depth).

Linus

I have found that having a cpu_relax() at the bottom of the while
loop in CMPXCHG_LOOP() macro does help performance in some case on
x86-64 processors. I saw no noticeable difference for the AIM7's
short workload. On the high_systime workload, however, I saw about 5%
better performance with cpu_relax(). Below were the perf profile of
the 2 high_systime runs at 1500 users on a 80-core DL980 with HT off.

Without cpu_relax():

4.60% ls [kernel.kallsyms] [k] _raw_spin_lock
|--48.50%-- lockref_put_or_lock
|--46.97%-- lockref_get_or_lock
|--1.04%-- lockref_get

2.95% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--29.70%-- lockref_put_or_lock
|--28.87%-- lockref_get_or_lock
|--0.63%-- lockref_get

With cpu_relax():

1.67% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--14.80%-- lockref_put_or_lock
|--14.04%-- lockref_get_or_lock

1.36% ls [kernel.kallsyms] [k] _raw_spin_lock
|--44.94%-- lockref_put_or_lock
|--43.12%-- lockref_get_or_lock

So I would suggest having some kind of conditional cpu_relax() in
the loop.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Matthew Wilcox: "Re: RFC Block Layer Extensions to Support NV-DIMMs"
Previous message: Linus Torvalds: "Re: [PATCH] lockref: remove cpu_relax() again"
In reply to: Tony Luck: "Re: [PATCH] lockref: remove cpu_relax() again"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]