Re: [TIP][RFC 6/7] futex: add requeue_pi calls

From: Darren Hart
Date: Thu Mar 05 2009 - 11:51:33 EST


Darren Hart wrote:
Darren Hart wrote:
From: Darren Hart <dvhltc@xxxxxxxxxx>

PI Futexes must have an owner at all times, so the standard requeue commands
aren't sufficient. The new commands properly manage pi futex ownership by
ensuring a futex with waiters has an owner at all times. Once complete these
patches will allow glibc to properly handle pi mutexes with pthread_condvars.

The approach taken here is to create two new futex op codes:

FUTEX_WAIT_REQUEUE_PI:
Threads will use this op code to wait on a futex (such as a non-pi waitqueue)
and wake after they have been requeued to a pi futex. Prior to returning to
userspace, they will take this pi futex (and the underlying rt_mutex).

futex_wait_requeue_pi() is currently the result of a high speed collision
between futex_wait and futex_lock_pi (with the first part of futex_lock_pi
being done by futex_requeue_pi_init() on behalf of the waiter).

FUTEX_REQUEUE_PI:
This call must be used to wake threads waiting with FUTEX_WAIT_REQUEUE_PI,
regardless of how many threads the caller intends to wake or requeue.
pthread_cond_broadcast should call this with nr_wake=1 and nr_requeue=-1 (all).
pthread_cond_signal should call this with nr_wake=1 and nr_requeue=0. The
reason being we need both callers to get the benefit of the
futex_requeue_pi_init() routine which will prepare the top_waiter (the thread
to be woken) to take possesion of the pi futex by setting FUTEX_WAITERS and
preparing the futex_q.pi_state. futex_requeue() also enqueues the top_waiter
on the rt_mutex via rt_mutex_start_proxy_lock(). If pthread_cond_signal used
FUTEX_WAKE, we would have a similar race window where the caller can return and
release the mutex before the waiters can fully wake, potentially leaving the
rt_mutex with waiters but no owner.

We hit a failed paging request running the testcase (7/7) in a loop
(only takes a few minutes at most to hit on my 8way x86_64 test
machine). It appears to be the result of splitting rt_mutex_slowlock()
across two execution contexts by means of rt_mutex_start_proxy_lock()
and rt_mutex_finish_proxy_lock(). The former calls
task_blocks_on_rt_mutex() on behalf of the waiting task prior to
requeuing and waking it by the requeueing thread. The latter is
executed upon wakeup by the waiting thread which somehow manages to call
the new __rt_mutex_slowlock() with waiter->task != NULL and still
succeed with try_to_take_lock(), this leads to corruption of the plists
and an eventual failed paging request. See 7/7 for the rather crude
testcase that causes this. Any tips on where this race might be
occuring are welcome.

After some judicious use of printk (ftrace from tip wouldn't let me set the current_tracer, permission denied)

Thanks to Steven for helping me get a working ftrace in tip.


I managed to catch a failing
scenario where the signaling thread returns to userspace and unlocks the mutex before the waiting thread calls __rt_mutex_slowunlock() (which is fine) but the signaler calls rt_mutex_fastunlock() instead of rt_mutex_slowunlock() which is what the rt_mutex_start_proxy_lock() was supposed to prevent, so I am apparently not fully preparing the waiter and enqueueing it on the rt_mutex. Annotated printk output:

Signaler thread in futex_requeue()
lookup_pi_state: allocating a new pi state
futex_requeue_pi_init: futex_lock_pi_atomic returned: 0
futex_requeue: futex_requeue_pi_init returned: 0

Signaler thread returned to userspace and did pthread_mutex_unlock()
rt_mutex_fastunlock: unlocked ffff88013d1749d0

Waiting thread woke up in futex_wait_requeue_pi() and tries to finish taking the lock:
__rt_mutex_slowlock: waiter->task is ffff8802bdd350c0
try_to_take_rt_mutex: assigned rt_mutex (ffff88013d1749d0) owner
to current ffff8802bdd350c0

Waiting thread get's the lock while waiter->task is not NULL (b/c the signaler didn't go through the slow path)
__rt_mutex_slowlock: got the lock

I'll continue looking into this tomorrow, but Steven if you have any ideas on what I may have missed in rt_mutex_start_proxy_lock() I'd appreciate any insight you might have to share. Thomas, I know you gave this function some thought as well, did I take a radically different approach to what you had in mind?

I've updated my tracing and can show that rt_mutex_start_proxy_lock() is not setting RT_MUTEX_HAS_WAITERS like it should be:

------------[ cut here ]------------
kernel BUG at kernel/rtmutex.c:646!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:
0/end_device-0:0/target0:0:0/0:0:0:0/vendor
Dumping ftrace buffer:
---------------------------------
<...>-3793 1d..3 558351872us : lookup_pi_state: allocating a new pi state
<...>-3793 1d..3 558351876us : lookup_pi_state: initial rt_mutex owner: ffff88023d9486c0
<...>-3793 1...2 558351877us : futex_requeue: futex_lock_pi_atomic returned: 0
<...>-3793 1...2 558351877us : futex_requeue: futex_requeue_pi_init returned: 0
<...>-3793 1...3 558351879us : rt_mutex_start_proxy_lock: task_blocks_on_rt_mutex returned 0
<...>-3793 1...3 558351880us : rt_mutex_start_proxy_lock: lock has waiterflag: 0
<...>-3793 1...1 558351888us : rt_mutex_unlock: unlocking ffff88023b5f6950
<...>-3793 1...1 558351888us : rt_mutex_unlock: lock waiter flag: 0
<...>-3793 1...1 558351889us : rt_mutex_unlock: unlocked ffff88023b5f6950
<...>-3783 0...1 558351893us : __rt_mutex_slowlock: waiter->task is ffff88023c872440
<...>-3783 0...1 558351897us : try_to_take_rt_mutex: assigned rt_mutex (ffff88023b5f6950) owner to current ffff88023c872440
<...>-3783 0...1 558351897us : __rt_mutex_slowlock: got the lock
---------------------------------

I'll start digging into why that's happening, but I wanted to share the trace output.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/