Re: [PATCH] rtmutex: ensure we wake up the top waiter

From: Wander Lairson Costa
Date: Tue Jan 31 2023 - 12:54:49 EST


On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote:
> On Thu, Jan 19, 2023 at 5:54 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >
> > Wander!
> >
> > On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote:
> > > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> > >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> > >> > phase, waiter may be initially in the top of the queue, but after
> > >> > dequeued and requeued it may no longer be true.
> > >>
> > >> That's related to your above argumentation in which way?
> > >>
> > >
> > > I think I made the mistake of not explicitly saying at least three
> > > tasks are involved:
> > >
> > > - A Task T1 that currently holds the mutex
> > > - A Task T2 that is the top waiter
> > > - A Task T3 that changes the top waiter
> > >
> > > T3 tries to acquire the mutex, but as T1 holds it, it calls
> > > task_blocked_on_lock() and saves the owner. It eventually calls
> > > rt_mutex_adjust_prio_chain(), but it releases the wait lock before
> > > doing so. This opens a window for T1 to release the mutex and wake up
> > > T2. Before T2 runs, T3 acquires the wait lock again inside
> > > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
> > > changes the top waiter, then 1) When T2 runs, it will verify that it
> > > is no longer the top waiter and comes back to sleep 2) As you observed
> > > below, the waiter doesn't point to the top waiter and, therefore, it
> > > will wake up the wrong task.
> >
> > This is still confusing as hell because the wait locks you are talking
> > about belong to different rtmutexes. So there is no drops wait lock and
> > reacquires wait lock window.
> >
> > There must be (multiple) lock chains involved, but I couldn't figure out
> > yet what the actaul scenario is in the case of a pure rt_spinlock clock
> > chain:
> >
>
> Sorry, it took so long to reply. I didn't have the traces anymore and
> had to regenerate them. You spotted an error in my analysis, then I
> had to start over.
>
> > > Another piece of information I forgot: I spotted the bug in the
> > > spinlock_rt, which uses a rtmutex under the hood. It has a different
> > > code path in the lock scenario, and there is no call to
> > > remove_waiter() (or I am missing something).
> >
> > Correct. But this still might be a lock chain issue where a non
> > rt_spinlock which allows early removal.
> >
> > > Anyway, you summed it up pretty well here: "@waiter is no longer the
> > > top waiter due to the requeue operation". I tried (and failed) to
> > > explain the call chain that ends up in the buggy scenario, but now I
> > > think I should just describe the fundamental problem (the waiter
> > > doesn't point to the top waiter).
> >
> > You really want to provide the information WHY this case can happen at
> > all. If it's not the removal case and related to some other obscure lock
> > chain problem, then we really need to understand the scenario because
> > that lets us analyze whether there are other holes we did not think
> > about yet.
> >
> > If you have traces which show the sequence of lock events leading to
> > this problem, then you should be able to decode the scenario. If you
> > fail to extract the information, then please provide the traces so we
> > can stare at them.
> >
>
> Here we go:
>
> Let L1 and L2 be two spinlocks.
>
> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
> waiter of L2.
>
> Let T2 be the task holding L2.
>
> Let T3 be a task trying to acquire L1.
>
> The following events will lead to a state in which the wait queue of L2
> isn't empty but nobody holds it.
>
> T1 T2 T3
> spin_lock(L1)
>
> raw_spin_lock(L1->wait_lock)
>
> rtlock_slowlock_locked(L1)
>
> task_blocks_on_rt_mutex(L1, T3)
>
> orig_waiter->lock = L1
>
> orig_waiter->task = T3
>
> raw_spin_unlock(L1->wait_lock)
>
> rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)
>
> spin_unlock(L2)
> rt_mutex_slowunlock(L2)
> raw_spin_lock(L2->wait_lock)
> wakeup(T1)
> raw_spin_unlock(L2->wait_lock)
>
> waiter = T1->pi_blocked_on
>
> waiter == rt_mutex_top_waiter(L2)
>
> waiter->task == T1
>
> raw_spin_lock(L2->wait_lock)
>
> dequeue(L2, waiter)
>
> update_prio(waiter, T1)
>
> enqueue(L2, waiter)
>
> waiter != rt_mutex_top_waiter(L2)
>
> L2->owner == NULL
>
> wakeup(T1)
>
> raw_spin_unlock(L2->wait_lock)
> T1 wakes up
> T1 != top_waiter(L2)
> schedule_rtlock()
>

Arrrrghhhh, s**t mail servers... Hopefully now it is formatted correctly:

Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty but nobody holds it.

T1                              T2                              T3
                                                                spin_lock(L1)
                                                                  raw_spin_lock(L1->wait_lock)
                                                                  rtlock_slowlock_locked(L1)
                                                                    task_blocks_on_rt_mutex(L1, T3)
                                                                      orig_waiter->lock = L1
                                                                      orig_waiter->task = T3
                                                                      raw_spin_unlock(L1->wait_lock)
                                                                      rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)

                                spin_unlock(L2)
                                  rt_mutex_slowunlock(L2)
                                    raw_spin_lock(L2->wait_lock)
                                    wakeup(T1)
                                    raw_spin_unlock(L2->wait_lock)
                                                                              waiter = T1->pi_blocked_on
                                                                              waiter == rt_mutex_top_waiter(L2)
                                                                              waiter->task == T1
                                                                              raw_spin_lock(L2->wait_lock)
                                                                              dequeue(L2, waiter)
                                                                              update_prio(waiter, T1)
                                                                              enqueue(L2, waiter)
                                                                              waiter != rt_mutex_top_waiter(L2)
                                                                              L2->owner == NULL
                                                                              wakeup(T1)
                                                                              raw_spin_unlock(L2->wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()