'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]

From: Peter Zijlstra
Date: Tue Mar 03 2020 - 07:01:32 EST


Hi All,

Added some people harvested from glibc.git and added libc-alpha.

We currently have 2 big new futex features proposed, and still have the
whole NUMA thing on the table.

The proposed features are:

- a vectored FUTEX_WAIT (as per the parent thread); allows userspace to
wait on up-to 128 futex values.

- multi-size (8,16,32) futexes (WAIT,WAKE,CMP_REQUEUE).

Both these features are specific to the 'simple' futex interfaces, that
is, they exclude all the PI / robust stuff.

As is; the vectored WAIT doesn't nicely interact with the multi-size
proposal (or for that matter with the already existing PRIVATE flag),
for not allowing to specify flags per WAIT instance, but this should be
fixable with some little changes to the proposed ABI.

The much bigger sticking point; as already noticed by the multi-size
patches; is that the current ABI is a limiting factor. The giant
horrible syscall.

Now, we have a whole bunch of futex ops that are already gone (FD) or
are fundamentally broken (REQUEUE) or partially weird (WAIT_BITSET has
CLOCK selection where WAIT does not) or unused (per glibc, WAKE_OP,
WAKE_BITSET, WAIT_BITSET (except for that CLOCK crud)).

So how about we introduce new syscalls:

sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);

struct futex_wait {
void *uaddr;
unsigned long val;
unsigned long flags;
};
sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
unsigned long flags, ktime_t *timo);

sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);

sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);

Where flags:

- has 2 bits for size: 8,16,32,64
- has 2 more bits for size (requeue) ??
- has ... bits for clocks
- has private/shared
- has numa


This does not provide BITSET functionality, as I found no use in glibc.
Both wait and wake have arguments left, do we needs this?

For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
node_id', with the following semantics:

- on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
directly used to index into per-node hash-tables. When -1, it is
replaced by the current node_id and an smp_mb() is issued before we
load and compare the @uaddr.

- on WAKE/REQUEUE, it is an immediate index.

Any invalid value with result in EINVAL.


Then later, we can look at doing sys_futex_{,un}lock_{,pi}(), which have
all the mind-meld associated with robust and PI and possibly optimistic
spinning etc.

Opinions?