Re: [RFC] Lock free fd lookup

From: Keith Owens
Date: Fri Jul 16 2004 - 19:57:22 EST


On Thu, 15 Jul 2004 23:27:32 -0700,
William Lee Irwin III <wli@xxxxxxxxxxxxxx> wrote:
>The gist of all this is that busywait-free atomic updates are only
>implementable for data structures that don't link through the object
>but rather maintain an external index with a single pointer to elements
>needing updates, like radix trees, B+ trees, arrays of pointers, and
>open-addressed hashtables.

There are algorithms for busywait-free (lock free) traversal and
concurrent update of lists and structures that contain embedded
pointers. I once had to implement one on an O/S where the user space
to kernel wait/alarm semantics could not satisfy the timing
requirements of the calling code (don't ask).

The beauty of these lockfree algorithms on large structures is that
nothing ever stalls indefinitely. If the underlying SMP cache hardware
is fair, everything running a lockfree list will make progress. These
algorithms do not suffer from reader vs. writer starvation.

The downside is that the algorithms rely on an extra sequence field per
word. They copy out the structure, modify the local copy, write back
with an update of sequence field. Writing back detects if any other
update has invalidated the current structure, at which point the second
update is discarded and the writer redrives its update. Readers are
guaranteed to see a consistent view of the copy, even if the master
structure is being updated at the same time as it is being read.

Bottom line, it can be done but ...

* The structure size doubles with the addition of the sequence fields.
* The hardware must support cmpxchg on the combination of the sequence
field and the data word that it protects. LL/SC can be used instead
of cmpxchg if the hardware supports LL/SC.
* If the hardware only supports cmpxchg4 then the sequence field is
restricted to 2 bytes, which increases the risk of wraparound. If an
update is delayed for exactly the wraparound interval then the data
may be corrupted, that is a standard risk of small sequence fields.
* The extra copy out just to read a structure needs more memory
bandwidth, plus local allocation for the copy.
* The internal code of the API to traverse a list containing lockfree
structures is pretty messy, although that is hidden from the caller.
* All writers must preserve enough state to redrive their update from
scratch if a concurrent update has changed the same structure. Note,
only the curent structure, not the rest of the list.
* It requires type stable storage. That is, once a data area has been
allocated to a particular structure type, it always contains that
structure type, even when it has been freed from the list. Each list
requires its own free pool, which can never be returned to the OS.

Nice research topics, but not suitable for day to day work. I only
used the lockfree algorithms because I could not find an alternative on
that particular OS. I am not sure that the Linux kernel has any
problems that require the additional complexity.

For some light reading, the code I implemented was based on Practical
Implementations of Non-Blocking Synchronization Primitives by Mark Moir
http://research.sun.com/scalable/Papers/moir-podc97.ps. Note that page
6, line 6 has an error. Replace "y := z" with "y := addr->data[i];".

I implemented a completely lockfree 2-3-4 tree with embedded pointers
using Moir's algorithm. The implementation allowed multiple concurrent
readers and writers. It even allowed concurrent list insert, delete
and rebalancing, while the structures on the were being read and
updated.

2-3-4 trees are self balancing, which gave decent lookup performance.
Since red-black trees are logically equivalent to 2-3-4 trees it should
be possible to use lockfree red-black trees. However I could not come
up with a lockfree red-black tree, mainly because an read-black insert
requires atomic changes to multiple structures. The 2-3-4 tree only
needs atomic update to one structure at a time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/