Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

From: Chris Mason
Date: Fri Jun 06 2025 - 17:06:47 EST


On 6/6/25 3:06 AM, Sebastian Andrzej Siewior wrote:
> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
>>>> We've got large systems that are basically dedicated to single
>>>> workloads, and those will probably miss the larger global hash table,
>>>> regressing like schbench did. Then we have large systems spread over
>>>> multiple big workloads that will love the private tables.
>>>>
>>>> In either case, I think growing the hash table as a multiple of thread
>>>> count instead of cpu count will probably better reflect the crazy things
>>>> multi-threaded applications do? At any rate, I don't think we want
>>>> applications to need prctl to get back to the performance they had on
>>>> older kernels.
>>>
>>> This is only an issue if all you CPUs spend their time in the kernel
>>> using the hash buckets at the same time.
>>> This was the case in every benchmark I've seen so far. Your thing might
>>> be closer to an actual workload.
>>>
>>
>> I didn't spend a ton of time looking at the perf profiles of the slower
>> kernel, was the bottleneck in the hash chain length or in contention for
>> the buckets?
>
> Every futex operation does a rcuref_get() (which is an atomic inc) on
> the private hash. This is before anything else happens. If you have two
> threads, on two CPUs, which simultaneously do a futex() operation then
> both do this rcuref_get(). That atomic inc ensures that the cacheline
> bounces from one CPU to the other. On the exit of the syscall there is a
> matching rcuref_put().
>
>>>> For people that want to avoid that memory overhead, I'm assuming they
>>>> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
>>>> make that more clear.
>>>>
>>>>> Then there the possibility of
>>> …
>>>>> 256 cores, 2xNUMA:
>>>>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
>>>>> | average rps: 785 446.07 Futex HBs: 1024 immutable: 0
>>>>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
>>>> rps: 736 769.77 Futex HBs: 2048 immutable: 0
>>>>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1
>>>>
>>>>
>>>> How long are these runs? That's a huge benefit from being immutable
>>>> (1.5M vs 736K?) but the hash table churn should be confined to early in
>>>> the schbench run right?
>>>
>>> I think 30 secs or so. I used your command line.
>>
>> Ah ok, my command line is 60 seconds. It feels like something is
>> strange for the immutable flag to make it that much faster? schbench
>> starts all the threads up front, so it should hit steady state pretty
>> quickly. More on NUMA below, but I'll benchmark with the immutable flag
>> on the turin box in the morning to see if it is the extra atomics.
>
> That immutable flag makes this rcuref_get()/ put() go away. The price is
> that you can't change the size of the private hash anymore. So if your
> workload works best with a hash size of X and you don't intend to change
> it during the runtime of the program, set the immutable flag.

ok, for the benchmarks, I hard coded the number of buckets at 1024 and
the only thing I flipped was the immutable flag. I was running on top
of c0c9379f235df33a12ceae94370ad80c5278324d (just today's Linus).

schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0

1024 bucket without immutable:

RPS 2 297 856

1024 with immutable:

RPS 3 665 920

This is pretty similar to what 6.15 obtains, so I'm hoping we can find a
way to get the 6.15 performance levels without the applications needing
to call prctl manually.

I took some profiles on the 1024 + non-immutable

16.92% schbench [kernel.kallsyms] [k] futex_hash_put
|
|--8.93%--fpost
|
|--6.99%--xlist_wake_all
|
--0.58%--entry_SYSCALL_64
syscall

13.27% schbench libc.so.6 [.] syscall
|
|--8.90%--xlist_wake_all
|
--3.98%--entry_SYSCALL_64
syscall

12.36% schbench [kernel.kallsyms] [k] entry_SYSCALL_64
|
|--11.04%--__futex_hash
| futex_hash
| futex_wake
| do_futex
| __x64_sys_futex
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--1.23%--futex_hash
futex_wake
do_futex
__x64_sys_futex
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

Percent │ 0xffffffff813f2f80 <futex_hash_put>:
0.05 │ → callq __fentry__
0.06 │ pushq %rbx
0.08 │ movq 0x18(%rdi),%rbx
0.05 │ testq %rbx,%rbx
0.03 │ ↓ je 15
12.49 │ cmpb $0x0,0x21(%rbx)
12.58 │ ↓ je 17
│15: popq %rbx
│ ← retq
0.24 │17: movl $0xffffffff,%esi
12.76 │ lock
│ xaddl %esi,(%rbx)
7.80 │ ┌──subl $0x1,%esi
22.96 │ ├──js 27
23.05 │ │ popq %rbx
7.85 │ │← retq
│27:└─→movq %rbx,%rdi
│ → callq rcuref_put_slowpath
│ testb %al,%al
│ ↑ je 15
│ movq 0x18(%rbx),%rdi
│ popq %rbx
│ → jmp wake_up_var

If you like profiles with line numbers:
12830 samples (10.24%) Comms: schbench
futex_hash_put @ /root/linux-6.15/kernel/futex/core.c:179:2
futex_private_hash_put @ /root/linux-6.15/kernel/futex/core.c:148:6 [inlined]
futex_private_hash_put @ /root/linux-6.15/kernel/futex/core.c:153:6 [inlined]
rcuref_put @ /root/linux-6.15/./include/linux/rcuref.h:173:13 [inlined]
__rcuref_put @ /root/linux-6.15/./include/linux/rcuref.h:110:5 [inlined]
futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:200:1
do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10
__x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1
__se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined]
__do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined]
do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7
do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined]
entry_SYSCALL_64_after_hwframe
13795 samples (11.01%) Comms: schbench
futex_hash @ /root/linux-6.15/kernel/futex/core.c:311:15
futex_private_hash_get @ /root/linux-6.15/kernel/futex/core.c:143:5 [inlined]
futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:172:2
class_hb_constructor @ /root/linux-6.15/kernel/futex/futex.h:242:1 [inlined]
do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10
__x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1
__se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined]
__do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined]
do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7
do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined]
entry_SYSCALL_64_after_hwframe
14364 samples (11.46%) Comms: schbench
entry_SYSCALL_64
17612 samples (14.05%) Comms: schbench
futex_hash @ /root/linux-6.15/kernel/futex/core.c:311:15
futex_private_hash_get @ /root/linux-6.15/kernel/futex/core.c:145:9 [inlined]
rcuref_get @ /root/linux-6.15/./include/linux/rcuref.h:87:5 [inlined]
futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:172:2
class_hb_constructor @ /root/linux-6.15/kernel/futex/futex.h:242:1 [inlined]
do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10
__x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1
__se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined]
__do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined]
do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7
do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined]
entry_SYSCALL_64_after_hwframe

With immutable, futexes are not in the top 10.

-chris