Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

From: Sebastian Andrzej Siewior
Date: Tue Sep 12 2023 - 07:26:32 EST


On 2023-09-12 13:17:16 [+0200], To Peter Zijlstra wrote:
> Comments follow…
>
> diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
> index f8e65b27d9d6b..d8866278e92ff 100644
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
> /*
> * Caller must hold a reference on @pi_state.
> */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> + struct futex_pi_state *pi_state,
> + struct rt_mutex_waiter *top_waiter)
> {
> - struct rt_mutex_waiter *top_waiter;
> struct task_struct *new_owner;
> bool postunlock = false;
> DEFINE_RT_WAKE_Q(wqh);
> u32 curval, newval;
> int ret = 0;
>
> - top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> - if (WARN_ON_ONCE(!top_waiter)) {
> - /*
> - * As per the comment in futex_unlock_pi() this should not happen.
> - *
> - * When this happens, give up our locks and try again, giving
> - * the futex_lock_pi() instance time to complete, either by
> - * waiting on the rtmutex or removing itself from the futex
> - * queue.
> - */
> - ret = -EAGAIN;
> - goto out_unlock;
> - }
> -
> new_owner = top_waiter->task;
>
> /*
> @@ -1046,15 +1033,18 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>
> cleanup:
> - spin_lock(q.lock_ptr);
> /*
> * If we failed to acquire the lock (deadlock/signal/timeout), we must
> * first acquire the hb->lock before removing the lock from the
> * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> * lists consistent.
> *
> - * In particular; it is important that futex_unlock_pi() can not
> - * observe this inconsistency.
> + * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> + * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> + *
> + * Doing the cleanup without holding hb->lock can cause inconsistent
> + * state between hb and pi_state, but only in the direction of seeing a
> + * waiter that is leaving.
> */
> if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
> ret = 0;
> @@ -1063,6 +1053,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> * Waiter is unqueued.
> */
> rt_mutex_post_schedule();

So no new lock-task can enter but one can give up and vanish.

> +
> + /*
> + * Now that the rt_waiter has been dequeued, it is safe to use
> + * spinlock/rtlock, which will enqueue a new rt_waiter.
> + */
> + spin_lock(q.lock_ptr);
> no_block:
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1143,6 +1139,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> top_waiter = futex_top_waiter(hb, &key);
> if (top_waiter) {
> struct futex_pi_state *pi_state = top_waiter->pi_state;
> + struct rt_mutex_waiter *rt_waiter;
>
> ret = -EINVAL;
> if (!pi_state)
> @@ -1158,19 +1155,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> get_pi_state(pi_state);
> /*
> * By taking wait_lock while still holding hb->lock, we ensure
> - * there is no point where we hold neither; and therefore
> - * wake_futex_p() must observe a state consistent with what we
> - * observed.
> + * there is no point where we hold neither; and thereby
> + * wake_futex_pi() must observe any new waiters.
> + *
> + * Since the cleanup: case in futex_lock_pi() removes the
> + * rt_waiter without holding hb->lock, it is possible for
> + * wake_futex_pi() to not find a waiter while the above does,
> + * in this case the waiter is on the way out and it can be
> + * ignored.
> *
> * In particular; this forces __rt_mutex_start_proxy() to
> * complete such that we're guaranteed to observe the
> - * rt_waiter. Also see the WARN in wake_futex_pi().
> + * rt_waiter.
> */
> raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> + /*
> + * Futex vs rt_mutex waiter state -- if there are on rt_mutex
> + * waiters even though futex thinkgs there are, then the waiter
> + * is leaving and the uncontended path is safe to take.
> + */
> + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> + if (!rt_waiter)
> + goto do_uncontended;

So I spent some time to come up with a test case for this and then
hacked the kernel to get into exact this situation. The wait_lock from
above needs to be released since it is done by wake_futex_pi()
otherwise.

Then we do earlier get_pi_state(pi_state) and I think a put here is in
order right?

We own the hb lock here but it is released later in the do_uncontended
label so that looks okay.

The userland state is modified under the hb-lock so that should work.

> +
> spin_unlock(&hb->lock);
>
> /* drops pi_state->pi_mutex.wait_lock */
> - ret = wake_futex_pi(uaddr, uval, pi_state);
> + ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>
> put_pi_state(pi_state);
>
> @@ -1198,6 +1210,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> return ret;
> }
>
> +do_uncontended:
> /*
> * We have no kernel internal state, i.e. no waiters in the
> * kernel. Waiters which are about to queue themselves are stuck

So I *think* this could work. Then we need something similar for
requeue-pi.

Sebastian