Re: Wait for mutex to become unlocked

From: Waiman Long
Date: Wed May 04 2022 - 21:18:12 EST


On 5/4/22 17:44, Matthew Wilcox wrote:
Paul, Liam and I were talking about some code we intend to write soon
and realised there's a missing function in the mutex & rwsem API.
We're intending to use it for an rwsem, but I think it applies equally
to mutexes.

The customer has a low priority task which wants to read /proc/pid/smaps
of a higher priority task. Today, everything is awful; smaps acquires
mmap_sem read-only, is preempted, then the high-pri task calls mmap()
and the down_write(mmap_sem) blocks on the low-pri task. Then all the
other threads in the high-pri task block on the mmap_sem as they take
page faults because we don't want writers to starve.

The approach we're looking at is to allow RCU lookup of VMAs, and then
take a per-VMA rwsem for read. Because we're under RCU protection,
that looks a bit like this:

rcu_read_lock();
vma = vma_lookup();
if (down_read_trylock(&vma->sem)) {
rcu_read_unlock();
} else {
rcu_read_unlock();
down_read(&mm->mmap_sem);
vma = vma_lookup();
down_read(&vma->sem);
up_read(&mm->mmap_sem);
}

(for clarity, I've skipped the !vma checks; don't take this too literally)

So this is Good. For the vast majority of cases, we avoid taking the
mmap read lock and the problem will appear much less often. But we can
do Better with a new API. You see, for this case, we don't actually
want to acquire the mmap_sem; we're happy to spin a bit, but there's no
point in spinning waiting for the writer to finish when we can sleep.
I'd like to write this code:

again:
rcu_read_lock();
vma = vma_lookup();
if (down_read_trylock(&vma->sem)) {
rcu_read_unlock();
} else {
rcu_read_unlock();
rwsem_wait_read(&mm->mmap_sem);
goto again;
}

That is, rwsem_wait_read() puts the thread on the rwsem's wait queue,
and wakes it up without giving it the lock. Now this thread will never
be able to block any thread that tries to acquire mmap_sem for write.

I suppose that a writer that needs to take a write lock on vma->sem will have to take a write lock on mmap_sem first, then it makes sense to me that you want to wait for all the vma->sem writers to finish by waiting on the wait queue of mmap_sem. By the time the waiting task is being woken up, there is no active write lock on the vma->sem and hopefully by the time the waiting process wakes up and do a down_read_trylock(), it will succeed. However, the time gap in the wakeup process may have another writer coming in taking the vma->sem write lock. It improves the chance of a successful trylock but it is not guaranteed. So you will need a retry count and revert back to a direct down_read() when there are too many retries.

Since the waiting process isn't taking any lock, the name rwsem_wait_read() may be somewhat misleading. I think a better name may be rwsem_flush_waiters(). So do you want to flush the waiters at the point this API is called or you want to wait until the wait queue is empty?

Cheers,
Longman