Re: handle_exit_race && PF_EXITING

From: Thomas Gleixner
Date: Wed Nov 06 2019 - 04:53:14 EST


Oleg,

On Wed, 6 Nov 2019, Oleg Nesterov wrote:
> I have found the fix I sent in 2015, attached below. I forgot everything
> I knew about futex.c, so I need some time to adapt it to the current code.
>
> But I think it is clear what this patch tries to do, do you see any hole?

> @@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
>
> if (!futex_cmpxchg_enabled)
> return;
> +
> /*
> - * We are a ZOMBIE and nobody can enqueue itself on
> - * pi_state_list anymore, but we have to be careful
> - * versus waiters unqueueing themselves:
> + * attach_to_pi_owner() can no longer add the new entry. But
> + * we have to be careful versus waiters unqueueing themselves.
> */
> + curr->flags |= PF_EXITPIDONE;

This obviously would need a barrier or would have to be moved inside of the
pi_lock region.

> raw_spin_lock_irq(&curr->pi_lock);
> while (!list_empty(head)) {
>
> @@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
> return -EPERM;
> }
>
> - /*
> - * We need to look at the task state flags to figure out,
> - * whether the task is exiting. To protect against the do_exit
> - * change of the task flags, we do this protected by
> - * p->pi_lock:
> - */
> raw_spin_lock_irq(&p->pi_lock);
> - if (unlikely(p->flags & PF_EXITING)) {
> - /*
> - * The task is on the way out. When PF_EXITPIDONE is
> - * set, we know that the task has finished the
> - * cleanup:
> - */
> - int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
> -
> + if (unlikely(p->flags & PF_EXITPIDONE)) {
> + /* exit_pi_state_list() was already called */
> raw_spin_unlock_irq(&p->pi_lock);
> put_task_struct(p);
> - return ret;
> + return -ESRCH;

But, this is incorrect because we'd return -ESRCH to user space while the
futex value still has the TID of the exiting task set which will
subsequently cleanout the futex and set the owner died bit.

The result is inconsistent state and will trigger the asserts in the futex
test suite and in the pthread_mutex implementation.

The only reason why -ESRCH can be returned is when the user space value of
the futex contains garbage. But in this case it does not contain garbage
and returning -ESRCH violates the implicit robustness guarantee of PI
futexes and causes unexpected havoc.

See da791a667536 ("futex: Cure exit race") for example.

The futex PI contract between kernel and user space relies on consistent
state. Guess why that code has more corner case handling than actual
functionality. :)

Thanks,

tglx