Re: [PATCH] [RESEND 2] Take over futex of dead task only if FUTEX_WAITERSis not set

From: Darren Hart
Date: Thu Oct 25 2012 - 00:34:35 EST




On 10/23/2012 01:29 PM, Thomas Gleixner wrote:
> Darren, Siddhesh,
>
> On Tue, 23 Oct 2012, Darren Hart wrote:
>
>> Hi Siddesh,
>>
>> Thanks for the patch and your work to isolate it in the glibc bug 14076.
>>
>> On 10/21/2012 08:20 PM, Siddhesh Poyarekar wrote:
>>> In futex_lock_pi_atomic, we consider that if the value in the futex
>>> variable is 0 with additional flags, then it is safe for takeover
>>> since the owner of the futex is dead. However, when FUTEX_WAITERS is
>>> set in the futex value, handle_futex_death calls futex_wake to wake up
>>> one task.
>>
>> It shouldn't for PI mutexes. It should just set the FUTEX_OWNER_DIED flag,
>> maintaining the FUTEX_WAITERS flag, and exit.
>>
>> int handle_futex_death(...
>> ...
>> /*
>> * Wake robust non-PI futexes here. The wakeup of
>> * PI futexes happens in exit_pi_state():
>> */
>> if (!pi && (uval & FUTEX_WAITERS))
>> futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
>
> Yes, the description of the problem is slightly wrong, but it still
> pinpoints the real wreckage.
>
>>> Hence the assumption in futex_lock_pi_atomic is not correct.
>>> The correct assumption is that a futex may be considered safe for a
>>> takeover if The FUTEX_OWNER_DIED bit is set, the TID bits are 0 and
>>> the FUTEX_WAITERS bit is not set.
> ...
>>> - if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
>>> + if (unlikely(ownerdied ||
>>> + !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
>
> This solves the problem at hand, but I'm not too happy with the
> solution. One of the real possible scenarios which expose the problem
> is:
>
> Futex F is initialized with PTHREAD_PRIO_INHERIT and
> PTHREAD_MUTEX_ROBUST_NP attributes.
>
> T1 lock_futex_pi(F);
>
> T2 lock_futex_pi(F);
>
> --> T2 blocks on the futex and creates pi_state which is associated
> to T1.
>
> T1 exits
>
> --> exit_robust_list() runs
>
> --> Futex F userspace value TID field is set to 0 and
> FUTEX_OWNER_DIED bit is set.
>
> T3 lock_futex_pi(F);
>
> --> Succeeds due to the check for F's userspace TID field == 0
>
> --> Claims ownership of the futex and sets its own TID into the
> userspace TID field of futex F
>
> --> returns to user space
>
> T1 --> exit_pi_state_list()
>
> --> Transfers pi_state to waiter T2 and wakes T2 via
> rt_mutex_unlock(&pi_state->mutex)
>
> T2 --> acquires pi_state->mutex and gains real ownership of the
> pi_state
>
> --> Claims ownership of the futex and sets its own TID into the
> userspace TID field of futex F
>
> --> returns to user space
>
> T3 --> observes inconsistent state
>
> This problem is independent of UP/SMP, preemptible/non preemptible
> kernels, or process shared vs. private. The only difference is that
> certain configurations are more likely to expose it.
>
> So as Siddhesh correctly analyzed the following check in
> futex_lock_pi_atomic() is the culprit:
>
> if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
>
> We check the userspace value for a TID value of 0 and take over the
> futex unconditionally if that's true.
>
> AFAICT this check is there as it is correct for a different corner
> case of futexes: the WAITERS bit became stale.
>
> Now the proposed change
>
> - if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
> + if (unlikely(ownerdied ||
> + !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
>
> solves the problem, but it's not obvious why and it wreckages the
> "stale WAITERS bit" case.


In what scenario does the WAITERS bit become stale for pi futexes? This
corner case seems rather core to your solution, so I would like to
understand it a bit better.


>
> What happens is, that due to the WAITERS bit being set (T2 is blocked
> on that futex) it enforces T3 to go through lookup_pi_state(), which
> in the above case returns an existing pi_state and therefor forces T3
> to legitimately fight with T2 over the ownership of the pi_state (via
> pi_state->mutex). Probelm solved!
>
> Though that does not work for the "WAITERS bit is stale" problem
> because if lookup_pi_state() does not find existing pi_state it
> returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
> return -ESRCH to user space because the OWNER_DIED bit is not set.
>
> Now there is a different solution to that problem. Do not look at the
> user space value at all and enforce a lookup of possibly available
> pi_state. If pi_state can be found, then the new incoming locker T3
> blocks on that pi_state and legitimately races with T2 to acquire the
> rt_mutex and the pi_state and therefor the proper ownership of the
> user space futex.


My first concern here is performance impact by forcing the pi_state
lookup, however, if we got this far, we already took the syscall, and
our performance sucks anyway. Correctness obviously trumps performance here.


>
> lookup_pi_state() has the correct order of checks. It first tries to
> find a pi_state associated with the user space futex and only if that
> fails it checks for futex TID value = 0. If no pi_state is available
> nothing can create new state at that point because this happens with
> the hash bucket lock held.
>
> So the above scenario changes to:
>
> T1 lock_futex_pi(F);
>
> T2 lock_futex_pi(F);
>
> --> T2 blocks on the futex and creates pi_state which is associated
> to T1.
>
> T1 exits
>
> --> exit_robust_list() runs
>
> --> Futex F userspace value TID field is set to 0 and
> FUTEX_OWNER_DIED bit is set.
>
> T3 lock_futex_pi(F);
>
> --> Finds pi_state and blocks on pi_state->rt_mutex
>
> T1 --> exit_pi_state_list()
>
> --> Transfers pi_state to waiter T2 and wakes it via
> rt_mutex_unlock(&pi_state->mutex)
>
> T2 --> acquires pi_state->mutex and gains ownership of the pi_state
>
> --> Claims ownership of the futex and sets its own TID into the
> userspace TID field of futex F
>
> --> returns to user space
>
> This covers all gazillion points on which T3 might come in between
> T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
> also solves the "WAITERS bit stale" problem by forcing the take over.
>
> Another benefit of changing the code this way is that it makes it less
> dependent on untrusted user space values and therefor minimizes the
> possible wreckage which might be inflicted.


That's a definite plus!


> As usual after staring for too long at the futex code my brain hurts
> so much that I really want to ditch that whole optimization of
> avoiding the syscall for the non contended case for PI futexes and rip
> out the maze of corner case handling code. Unfortunately we can't as
> user space relies on that existing behaviour, but at least thinking
> about it helps me to preserve my mental sanity. Maybe we should
> nevertheless :)


I was surprised at how fast you were able to page all this in after all
that travel - or is this what you did for 12 hours on the plane?

:-)

--
Darren


> Thanks,
>
> tglx
>
> -------------->
>
> Index: linux/kernel/futex.c
> ===================================================================
> --- linux.orig/kernel/futex.c
> +++ linux/kernel/futex.c
> @@ -716,7 +716,7 @@ static int futex_lock_pi_atomic(u32 __us
> struct futex_pi_state **ps,
> struct task_struct *task, int set_waiters)
> {
> - int lock_taken, ret, ownerdied = 0;
> + int lock_taken, ret, force_take = 0;
> u32 uval, newval, curval, vpid = task_pid_vnr(task);
>
> retry:
> @@ -755,17 +755,15 @@ retry:
> newval = curval | FUTEX_WAITERS;
>
> /*
> - * There are two cases, where a futex might have no owner (the
> - * owner TID is 0): OWNER_DIED. We take over the futex in this
> - * case. We also do an unconditional take over, when the owner
> - * of the futex died.
> - *
> - * This is safe as we are protected by the hash bucket lock !
> + * Should we force take the futex? See below.
> */
> - if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
> - /* Keep the OWNER_DIED bit */
> + if (unlikely(force_take)) {
> + /*
> + * Keep the OWNER_DIED and the WAITERS bit and set the
> + * new TID value.
> + */
> newval = (curval & ~FUTEX_TID_MASK) | vpid;
> - ownerdied = 0;
> + force_take = 0;
> lock_taken = 1;
> }
>
> @@ -775,7 +773,7 @@ retry:
> goto retry;
>
> /*
> - * We took the lock due to owner died take over.
> + * We took the lock due to forced take over.
> */
> if (unlikely(lock_taken))
> return 1;
> @@ -790,20 +788,25 @@ retry:
> switch (ret) {
> case -ESRCH:
> /*
> - * No owner found for this futex. Check if the
> - * OWNER_DIED bit is set to figure out whether
> - * this is a robust futex or not.
> + * We failed to find an owner for this
> + * futex. So we have no pi_state to block
> + * on. This can happen in two cases:
> + *
> + * 1) The owner died
> + * 2) A stale FUTEX_WAITERS bit
> + *
> + * Re-read the futex value.
> */
> if (get_futex_value_locked(&curval, uaddr))
> return -EFAULT;
>
> /*
> - * We simply start over in case of a robust
> - * futex. The code above will take the futex
> - * and return happy.
> + * If the owner died or we have a stale
> + * WAITERS bit the owner TID in the user space
> + * futex is 0.
> */
> - if (curval & FUTEX_OWNER_DIED) {
> - ownerdied = 1;
> + if (!(curval & FUTEX_TID_MASK)) {
> + force_take = 1;
> goto retry;
> }
> default:
>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/