Re: [PATCH V5 0/4][RFC] futex: FUTEX_LOCK with optional adaptivespinning

From: Darren Hart
Date: Thu Apr 15 2010 - 02:13:34 EST

Next message: JÃrn Engel: "Re: [PATCH] LogFS: Fix oops on failed mount"
Previous message: Xiaotian Feng: "Re: [PATCH] Kconfig: Make config Filter access to /dev/mem default y"
In reply to: dvhltc: "[PATCH 3/4] futex: refactor futex_lock_pi_atomic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

dvhltc@xxxxxxxxxx wrote:

Now that an advantage can be shown using FUTEX_LOCK_ADAPTIVE over FUTEX_LOCK,
the next steps as I see them are:

o Try and show improvement of FUTEX_LOCK_ADAPTIVE over FUTEX_WAIT based
implementations (pthread_mutex specifically).

I've spent a bit of time on this, and made huge improvements through some simple optimizations of the testcase lock/unlock routines. I'll be away for a few days and wanted to let people know where things stand with FUTEX_LOCK_ADAPTIVE.

I ran all the tests with the following options:
-i 1000000 -p 1000 -d 20
where:
-i iterations
-p period (in instructions)
-d duty cycle (in percent)

MECHANISM KITERS/SEC
----------------------------------
pthread_mutex_adaptive 1562
FUTEX_LOCK_ADAPTIVE 1190
pthread_mutex 1010
FUTEX_LOCK 532

I took some perf data while running each of the above tests as well. Any thoughts on getting more from perf are appreciated, this is my first pass at it. I recorded with "perf record -fg" and snippets of "perf report" follow:

FUTEX_LOCK (not adaptive) spends a lot of time spinning on the futex hashbucket lock.
# Overhead Command Shared Object Symbol
# ........ .......... .................. ......
#
40.76% futex_lock [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--62.16%-- do_futex
| sys_futex
| system_call_fastpath
| syscall
|
|--31.05%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| syscall
...
14.98% futex_lock futex_lock [.] locktest

FUTEX_LOCK_ADAPTIVE spends much of its time in the test loop itself, followed by the actual adaptive loop in the kernel. It appears much of our savings over FUTEX_LOCK comes from not contending on the hashbucket lock.
# Overhead Command Shared Object Symbol
# ........ .......... .................. ......
#
36.07% futex_lock futex_lock [.] locktest
|
--- locktest
|
--100.00%-- 0x400e7000000000

9.12% futex_lock perf [.] 0x00000000000eee
...
8.26% futex_lock [kernel.kallsyms] [k] futex_spin_on_owner

Pthread Mutex Adaptive spends most of it's time in the glibc heuristic spinning, as expected, followed by the test loop itself. An impressively minimal 3.35% is spent on the hashbucket lock.
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ......
#
47.88% pthread_mutex_2 libpthread-2.5.so [.] __pthread_mutex_lock_internal
|
--- __pthread_mutex_lock_internal

22.78% pthread_mutex_2 pthread_mutex_2 [.] locktest
...
15.16% pthread_mutex_2 perf [.] ...
...
3.35% pthread_mutex_2 [kernel.kallsyms] [k] _raw_spin_lock

Pthread Mutex (not adaptive) spends much of it's time on the hashbucket lock as expected, followed by the test loop.
33.89% pthread_mutex_2 [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--56.90%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| __lll_unlock_wake
|
|--28.95%-- futex_wait_setup
| futex_wait
| do_futex
| sys_futex
| system_call_fastpath
| __lll_lock_wait
...
16.60% pthread_mutex_2 pthread_mutex_2 [.] locktest

These results mostly confirm the expected: the adaptive versions spend more time in their spin loops and less time contending for hashbucket locks while the non-adaptive versions take the hashbucket lock more often, and therefore shore more contention there.

I believe I should be able to get the plain FUTEX_LOCK implementation to be much closer in performance to the plain pthread mutex version. I expect much of the work done to benefit FUTEX_LOCK will also benefit FUTEX_LOCK_ADAPTIVE. If that's true, and I can make a significant improvement to FUTEX_LOCK, it wouldn't take much to get FUTEX_LOCK_ADAPTIVE to beat the heuristics spinlock in glibc.

It could also be that this synthetic benchmark is an ideal situation for glibc's heuristics, and a more realistic load with varying lock hold times wouldn't favor the adaptive pthread mutex over FUTEX_LOCK_ADAPTIVE by such a large margin.

More next week.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: JÃrn Engel: "Re: [PATCH] LogFS: Fix oops on failed mount"
Previous message: Xiaotian Feng: "Re: [PATCH] Kconfig: Make config Filter access to /dev/mem default y"
In reply to: dvhltc: "[PATCH 3/4] futex: refactor futex_lock_pi_atomic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]