Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

From: Shrikanth Hegde
Date: Wed Jul 16 2025 - 14:23:24 EST




On 7/16/25 19:59, Peter Zijlstra wrote:
On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:

I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
./perf bench futex hash -Ib512
Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
Futex hashing: 512 hash buckets (immutable)

So, with -b 512 option, it is around 8-10% less compared to immutable.

Urgh, can you run perf on that and tell me if this is due to
this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
doing LWSYNC ?

It seems like due to rcu and irq enable.
Both perf records are collected with -b512.


base_futex_immutable_b512 - perf record collected with baseline + remove BROKEN + ./perf bench futex hash -Ib512
per_cpu_futex_hash_b_512 - baseline + series + ./perf bench futex hash -b512


perf diff base_futex_immutable_b512 per_cpu_futex_hash_b_512
# Event 'cycles'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... .......................... ....................................................
#
21.62% -2.26% [kernel.vmlinux] [k] futex_get_value_locked
0.16% +2.01% [kernel.vmlinux] [k] __rcu_read_unlock
1.35% +1.63% [kernel.vmlinux] [k] arch_local_irq_restore.part.0
+1.48% [kernel.vmlinux] [k] futex_private_hash_put
+1.16% [kernel.vmlinux] [k] futex_ref_get
10.41% -0.78% [kernel.vmlinux] [k] system_call_vectored_common
1.24% +0.72% perf [.] workerfn
5.32% -0.66% [kernel.vmlinux] [k] futex_q_lock
2.48% -0.43% [kernel.vmlinux] [k] futex_wait
2.47% -0.40% [kernel.vmlinux] [k] _raw_spin_lock
2.98% -0.35% [kernel.vmlinux] [k] futex_q_unlock
2.42% -0.34% [kernel.vmlinux] [k] __futex_wait
5.47% -0.32% libc.so.6 [.] syscall
4.03% -0.32% [kernel.vmlinux] [k] memcpy_power7
0.16% +0.22% [kernel.vmlinux] [k] arch_local_irq_restore
5.93% -0.18% [kernel.vmlinux] [k] futex_hash
1.72% -0.17% [kernel.vmlinux] [k] sys_futex



Anyway, I think we can improve both. Does the below help?


---
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb5567af0c..8c41d050bd1f 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_inc(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_inc(*mm->futex_ref);
return true;
}
@@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
{
struct mm_struct *mm = fph->mm;
- guard(rcu)();
+ guard(preempt)();
- if (smp_load_acquire(&fph->state) == FR_PERCPU) {
- this_cpu_dec(*mm->futex_ref);
+ if (READ_ONCE(fph->state) == FR_PERCPU) {
+ __this_cpu_dec(*mm->futex_ref);
return false;
}

Yes. It helps. It improves "-b 512" numbers by at-least 5%.

baseline + series:
Averaged 1412543 operations/sec (+- 0.14%), total secs = 10
Futex hashing: 512 hash buckets


baseline + series+ above_patch:
Averaged 1482733 operations/sec (+- 0.26%), total secs = 10 <<< 5% improvement
Futex hashing: 512 hash buckets


Now we are closer baseline/immutable by 4-5%.
baseline:
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD)

./perf bench futex hash
Averaged 1559643 operations/sec (+- 0.09%), total secs = 10
Futex hashing: global hash