Re: [PATCH v2 2/6] futex: Use RCU-based per-CPU reference counting instead of rcuref_t

From: André Draszik
Date: Wed Jul 30 2025 - 08:23:20 EST


On Thu, 2025-07-10 at 13:00 +0200, Sebastian Andrzej Siewior wrote:
> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>
> The use of rcuref_t for reference counting introduces a performance bottleneck
> when accessed concurrently by multiple threads during futex operations.
>
> Replace rcuref_t with special crafted per-CPU reference counters. The
> lifetime logic remains the same.
>
> The newly allocate private hash starts in FR_PERCPU state. In this state, each
> futex operation that requires the private hash uses a per-CPU counter (an
> unsigned int) for incrementing or decrementing the reference count.
>
> When the private hash is about to be replaced, the per-CPU counters are
> migrated to a atomic_t counter mm_struct::futex_atomic.
> The migration process:
> - Waiting for one RCU grace period to ensure all users observe the
>   current private hash. This can be skipped if a grace period elapsed
>   since the private hash was assigned.
>
> - futex_private_hash::state is set to FR_ATOMIC, forcing all users to
>   use mm_struct::futex_atomic for reference counting.
>
> - After a RCU grace period, all users are guaranteed to be using the
>   atomic counter. The per-CPU counters can now be summed up and added to
>   the atomic_t counter. If the resulting count is zero, the hash can be
>   safely replaced. Otherwise, active users still hold a valid reference.
>
> - Once the atomic reference count drops to zero, the next futex
>   operation will switch to the new private hash.
>
> call_rcu_hurry() is used to speed up transition which otherwise might be
> delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
> side effects would be that on auto scaling the new hash is used later
> and the SET_SLOTS prctl() will block longer.
>
> [bigeasy: commit description + mm get/ put_async]

kmemleak complains about a new memleak with this commit:

[ 680.179004][ T101] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

$ cat /sys/kernel/debug/kmemleak
unreferenced object (percpu) 0xc22ec0eface8 (size 4):
comm "swapper/0", pid 1, jiffies 4294893115
hex dump (first 4 bytes on cpu 7):
01 00 00 00 ....
backtrace (crc b8bc6765):
kmemleak_alloc_percpu+0x48/0xb8
pcpu_alloc_noprof+0x6ac/0xb68
futex_mm_init+0x60/0xe0
mm_init+0x1e8/0x3c0
mm_alloc+0x5c/0x78
init_args+0x74/0x4b0
debug_vm_pgtable+0x60/0x2d8
do_one_initcall+0x128/0x3e0
do_initcall_level+0xb4/0xe8
do_initcalls+0x60/0xb0
do_basic_setup+0x28/0x40
kernel_init_freeable+0x158/0x1f8
kernel_init+0x2c/0x1e0
ret_from_fork+0x10/0x20

And futex_mm_init+0x60/0xe0 resolves to
mm->futex_ref = alloc_percpu(unsigned int);
in futex_mm_init().

Reverting this commit (and patches 3 and 4 in this series due to context),
makes kmemleak happy again.

Cheers,
Andre'