Wait for mutex to become unlocked

From: Matthew Wilcox
Date: Wed May 04 2022 - 17:44:39 EST


Paul, Liam and I were talking about some code we intend to write soon
and realised there's a missing function in the mutex & rwsem API.
We're intending to use it for an rwsem, but I think it applies equally
to mutexes.

The customer has a low priority task which wants to read /proc/pid/smaps
of a higher priority task. Today, everything is awful; smaps acquires
mmap_sem read-only, is preempted, then the high-pri task calls mmap()
and the down_write(mmap_sem) blocks on the low-pri task. Then all the
other threads in the high-pri task block on the mmap_sem as they take
page faults because we don't want writers to starve.

The approach we're looking at is to allow RCU lookup of VMAs, and then
take a per-VMA rwsem for read. Because we're under RCU protection,
that looks a bit like this:

rcu_read_lock();
vma = vma_lookup();
if (down_read_trylock(&vma->sem)) {
rcu_read_unlock();
} else {
rcu_read_unlock();
down_read(&mm->mmap_sem);
vma = vma_lookup();
down_read(&vma->sem);
up_read(&mm->mmap_sem);
}

(for clarity, I've skipped the !vma checks; don't take this too literally)

So this is Good. For the vast majority of cases, we avoid taking the
mmap read lock and the problem will appear much less often. But we can
do Better with a new API. You see, for this case, we don't actually
want to acquire the mmap_sem; we're happy to spin a bit, but there's no
point in spinning waiting for the writer to finish when we can sleep.
I'd like to write this code:

again:
rcu_read_lock();
vma = vma_lookup();
if (down_read_trylock(&vma->sem)) {
rcu_read_unlock();
} else {
rcu_read_unlock();
rwsem_wait_read(&mm->mmap_sem);
goto again;
}

That is, rwsem_wait_read() puts the thread on the rwsem's wait queue,
and wakes it up without giving it the lock. Now this thread will never
be able to block any thread that tries to acquire mmap_sem for write.

Similarly, it may make sense to add rwsem_wait_write() and mutex_wait().
Perhaps also mutex_wait_killable() and mutex_wait_interruptible()
(the combinatoric explosion is a bit messy; I don't know that it makes
sense to do the _nested, _io variants).

Does any of this make sense?