[PATCH RT 0/6] New read/write locks for PI and multiple readers

From: Steven Rostedt
Date: Fri Apr 25 2008 - 10:21:30 EST


Currently in the -rt patch the rwlocks (and rwsems) are converted into a
simple mutex. That is, when PREEMP_RT is configured, the rwsems and rwlocks
can only have one reader at a time. Those locks may only be held by a single
owner (reader or writer). The reason for this is to allow priority
inheritance for these locks. PI can be quite complex, but it is easier
to handle PI when dealing with only a single owner. When multiple owners
are involved it gets even more complex. But this patch series aims to
accomplish PI over multiple owners.

Having multiple owners may also have other drawbacks in RT. One is that
it adds a bit more non-determinism. Because a writer may have to wait
the critical path time * each reader that currently holds the lock.
Since the number of readers is not set, this creates a non-deterministic
time frame. To deal with this, this patch series adds the ability to
limit the number of readers that may own the lock. Setting that number
to the number of CPUS may be a good idea.

Having no limit to the number of readers may also be preferable. Although
it adds some non-determinism, the PI on the readers prevents unbounded priority
inversion, where a task can block a lower priority task that holds a lock
that is keeping higher priority tasks from running.

The way this algorithm works is by adding a new descriptor called
rw_mutex. This descriptor also contains a RT mutex (one with PI).
It also adds a new owner field in the rw_mutex. When taking the lock for
read, it does a cmpxchg (if the arch supports it) on the owner field placing
in its own current pointer if the owner field is currently NULL. If this
succeeds, then it grabs the lock and increments the owner counts.
If it fails it enters the slow path (if the arch doesn't have cmpxchg then
it always enters the slow path).

The slow path for read checks to see if the owner field is set for readers,
and if it is, it ups the owner counts of the locks and sets up the pointers
from the lock to the readers and grabs the lock.

If it fails, the reader blocking is much like the same as a simple mutex
since it may boost the writer (which there would be only one). But first
if the owner of the rw_lock is not set in the rt_mutex, then the owner
is updated, to allow the old PI algorithm to work (see patches for details).

For a writer to take the lock, in the fastpath it does a cmpxchg to the
rw_mutex field by trying to copy in the current pointer plus the second
least significant bit (the first least significant bit is set for serializing
fast paths with slowpaths). The second least significant bit in the
owner field denotes that the owner is a writer.

If it fails the fast path, it enters the slow path. There if it still can't
get the lock, it updates the rt_mutex owner field (don't confuse the rt_mutex
with the rw_mutex descriptors). If the owner of the lock is a reader the
rt_mutex owner is given a factitious owner (RT_RW_READER).

The current PI algorithm takes at most two of the internal spin_locks
at a time and checks for deadlocks. One reason to do this is that this
algorithm is also used with futexes, and we can't allow user space code
be able deadlocking the kernel. But because rwlocks and rwsems are never
held while taking a futex (that would just be bad) we can trust the
rwlocks and rwsems will not deadlock (or the kernel itself is bad).

Because of this, when the PI algorithm walks the chain and comes across
a owner of RT_RW_READER, it breaks out of the loop, and will contiune
to hold that rt_mutex's internal spin lock. Then it will go through
each of the readers and continue the algorithm on each.

Yes this contains a recursive function, but it also keeps track
at the number of times it recurs. Which brings us to the limits that the
algorithm imposes. A task can only hold at most 5 different reader
locks at the same time. This should not be an issue, since it would be
bad programming practice to hold more that 5 different reader locks.
This is because each task has an array of 5 reader lock structures that
it uses to keep track of the reader locks it owns. This is for use
of the PI algorithm.

For more on this code, see the patches.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/