Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

From: Peter Zijlstra
Date: Thu Jun 26 2025 - 09:18:13 EST

Next message: Neeraj Kumar: "Re: [RFC PATCH 05/20] nvdimm/region_label: Add region label updation routine"
Previous message: Vegard Nossum: "Re: [PATCHv7 00/16] x86: Enable Linear Address Space Separation support"
In reply to: Chris Mason: "Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash""
Next in thread: Sebastian Andrzej Siewior: "Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jun 26, 2025 at 07:01:23AM -0400, Chris Mason wrote:
> On 6/24/25 3:01 PM, Peter Zijlstra wrote:
> > On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote:
> >> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
> >>>>> We've got large systems that are basically dedicated to single
> >>>>> workloads, and those will probably miss the larger global hash table,
> >>>>> regressing like schbench did. Then we have large systems spread over
> >>>>> multiple big workloads that will love the private tables.
> >>>>>
> >>>>> In either case, I think growing the hash table as a multiple of thread
> >>>>> count instead of cpu count will probably better reflect the crazy things
> >>>>> multi-threaded applications do? At any rate, I don't think we want
> >>>>> applications to need prctl to get back to the performance they had on
> >>>>> older kernels.
> >>>>
> >>>> This is only an issue if all you CPUs spend their time in the kernel
> >>>> using the hash buckets at the same time.
> >>>> This was the case in every benchmark I've seen so far. Your thing might
> >>>> be closer to an actual workload.
> >>>>
> >>>
> >>> I didn't spend a ton of time looking at the perf profiles of the slower
> >>> kernel, was the bottleneck in the hash chain length or in contention for
> >>> the buckets?
> >>
> >> Every futex operation does a rcuref_get() (which is an atomic inc) on
> >> the private hash. This is before anything else happens. If you have two
> >> threads, on two CPUs, which simultaneously do a futex() operation then
> >> both do this rcuref_get(). That atomic inc ensures that the cacheline
> >> bounces from one CPU to the other. On the exit of the syscall there is a
> >> matching rcuref_put().
> >
> > How about something like this (very lightly tested)...
> >
> > the TL;DR is that it turns all those refcounts into per-cpu ops when
> > there is no hash replacement pending (eg. the normal case), and only
> > folds the lot into an atomic when we really care about it.
> >
> > There's some sharp corners still.. but it boots and survives the
> > (slightly modified) selftest.
>
> I can get some benchmarks going of this, thanks. For 6.16, is the goal
> to put something like this in, or default to the global hash table until
> we've nailed it down?
>
> I'd vote for defaulting to global for one more release.

Probably best to do that; means we don't have to rush crazy code :-)

Next message: Neeraj Kumar: "Re: [RFC PATCH 05/20] nvdimm/region_label: Add region label updation routine"
Previous message: Vegard Nossum: "Re: [PATCHv7 00/16] x86: Enable Linear Address Space Separation support"
In reply to: Chris Mason: "Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash""
Next in thread: Sebastian Andrzej Siewior: "Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]