[PATCH v6 0/5] futex: Wakeup optimizations

From: Davidlohr Bueso
Date: Sun Jan 12 2014 - 18:32:16 EST


Changes from v5 [https://lkml.org/lkml/2014/1/2/178]:
- Added latest review tags

- Removed redundant CONFIG_SMP ifdef in futex_get_mm()

- Updated comments per Darren's suggestions.

Changes from v3/v4 [http://lkml.org/lkml/2013/12/19/627]:
Â- Almost completely redid patch 4, based on suggestions
ÂÂ by Linus. Instead of adding an atomic counter to keep
ÂÂ track of the plist size, couple the list's head empty
ÂÂ call with a check to see if the hb lock is locked.
ÂÂ This solves the race that motivated the use of the new
ÂÂ atomic field.

Â- Fix grammar in patch 3

Â- Fix SOB tags.

Changes from v2 [http://lwn.net/Articles/575449/]:
Â- Reordered SOB tags to reflect me as primary author.

Â- Improved ordering guarantee comments for patch 4.

Â- Rebased patch 4 against Linus' tree (this patch didn't
ÂÂ apply after the recent futex changes/fixes).

Changes from v1 [https://lkml.org/lkml/2013/11/22/525]:
Â- Removed patch "futex: Check for pi futex_q only once".

Â- Cleaned up ifdefs for larger hash table.

Â- Added a doc patch from tglx that describes the futex
ÂÂ ordering guarantees.

Â- Improved the lockless plist check for the wake calls.
ÂÂ Based on the community feedback, the necessary abstractions
ÂÂ and barriers are added to maintain ordering guarantees.
ÂÂ Code documentation is also updated.

Â- Removed patch "sched,futex: Provide delayed wakeup list".
ÂÂ Based on feedback from PeterZ, I will look into this as
ÂÂ a separate issue once the other patches are settled.

We have been dealing with a customer database workload on large
12Tb, 240 core 16 socket NUMA system that exhibits high amounts
of contention on some of the locks that serialize internal futex
data structures. This workload specially suffers in the wakeup
paths, where waiting on the corresponding hb->lock can account for
up to ~60% of the time. The result of such calls can mostly be
classified as (i) nothing to wake up and (ii) wakeup large amount
of tasks.

Before these patches are applied, we can see this pathological behavior:

Â37.12% 826174 xxx [kernel.kallsyms] [k] _raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂ --- _raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂ |--97.14%-- futex_wake
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ do_futex
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ sys_futex
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ system_call_fastpath
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |--99.70%-- 0x7f383fbdea1f
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂ yyy

Â43.71% 762296 xxx [kernel.kallsyms] [k] _raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂ --- _raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂ |--53.74%-- futex_wake
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ do_futex
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ sys_futex
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ system_call_fastpath
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |--99.40%-- 0x7fe7d44a4c05
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂ zzz
ÂÂÂÂÂÂÂÂÂÂÂÂ |--45.90%-- futex_wait_setup
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ futex_wait
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ do_futex
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ sys_futex
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ system_call_fastpath
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ 0x7fe7ba315789
ÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ syscall


With these patches, contention is practically non existent:

Â0.10% 49 xxx [kernel.kallsyms] [k] _raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --- _raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |--76.06%-- futex_wait_setup
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ futex_wait
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ do_futex
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ sys_futex
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ system_call_fastpath
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |--99.90%-- 0x7f3165e63789
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ syscall|
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ...
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |--6.27%-- futex_wake
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ do_futex
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ sys_futex
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ system_call_fastpath
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |--54.56%-- 0x7f317fff2c05
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ...

Patch 1 is a cleanup.

Patch 2 addresses the well known issue of the global hash table.
By creating a larger and NUMA aware table, we can reduce the false
sharing and collisions, thus reducing the chance of different futexes
using hb->lock.

Patch 3 documents the futex ordering guarantees.

Patch 4 reduces contention on the corresponding hb->lock by not trying to
acquire it if there are no blocked tasks in the waitqueue.
This particularly deals with point (i) above, where we see that it is not
uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
tasks were woken.

Patch 5 silences a gcc warning msg.

This patchset has also been tested on smaller systems for a variety of
benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
hb locks programs. So far, no functional or performance regressions have been seen.
Furthermore, no issues were found when running the different tests in the futextest
suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

This patchset applies on top of Linus' tree as of v3.13-rc6 (9a0bb296)

Special thanks to Scott Norton, Tom Vanden, Mark Ray and Aswin Chandramouleeswaran
for help presenting, debugging and analyzing the data.

futex: Misc cleanups
futex: Larger hash table
futex: Document ordering guarantees
futex: Avoid taking hb lock if nothing to wakeup
futex: Silence uninitialized warnings

kernel/futex.c | 205 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 163 insertions(+), 42 deletions(-)

--
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/