Re: [PATCH 0/3] 64-bit futexes: Intro

From: Nick Piggin
Date: Thu Jun 05 2008 - 21:28:34 EST


On Mon, Jun 02, 2008 at 01:22:57PM -0700, Linus Torvalds wrote:
>
> [ I added Nick and DavidH to the Cc, since they have at least historically
> shown interest in locking algorithms ]
>
> On Mon, 2 Jun 2008, Ingo Molnar wrote:
> >
> > i suspect _any_ abstract locking functionality around a data structure
> > can be implemented via atomic control over just a single user-space bit.
>
> Well, theory is theory, and practice is different.
>
> That's *especially* true when it comes to locking, which is just so
> tightly coupled to the exact details of which atomics are fast on a
> particular architecture.
>
> Also, different people will want to see different performance profiles.
>
> For example, it makes a *huge* difference whether you are strictly fair or
> not. I can almost guarantee that a 100% fair implementation may be really
> "nice" from a theoretical standpoint, but suck really badly in practice,
> because if you want best performance, then you want the lock to have a
> very strong CPU affinity - and the faster you can do your lock ops, the
> more of a unfairness and CPU affinity they get!
>
> And rwlocks in particular are actually much more interesting in this
> respect, because they not only have that CPU affinity fairness, but also
> the reader-vs-writer fairness. You optimize for one load, and it may give
> you almost perfect performance, but at the cost of sucking at another
> load.
>
> For example, some loads are almost entirely read-read locks, with only
> very occasional write locks for some uncommon config change thing. Do you
> want to optimize for that? Maybe. And yes, you can make those uncontended
> read-read locks go really quickly, but then (especially if you continue to
> let reads through even when writers want to contend), that can slow down
> writers a *lot*, to the point of starvation.
>
> Different CPU's will also show different patterns.
>
> Anyway, I was busy most of the weekend, but I've now coded up a partial
> actual example implementation. Its' probably buggy. Uli is very correct in
> saying that it's easy to screw up futex'es, but even in the absense of
> futexes, it's just easy to screw up any threaded logic.
>
> But if anybody wants to look at a rough draft that at least limps along
> _partially_, there's
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary
>
> for you.
>
> And I'll freely admit that
> (a) the above thing is pretty hacked up. No guarantees as to correctness
> or anything else.
> (b) I looked _mainly_ at the "all readers" case, and considered that the
> primary goal. Which explains why it does really well on that
> particular load.
> (c) However, I refuse to starve writers. In fact, my thing will always
> prefer a writer to any new readers, on the assumption that the sanest
> rwlock situation is the "tons of readers, occasional writer".
> (d) it does ok for me on the write-write contention case too, but the
> random mixed "all threads do 5% writelocks" test-case seems to suck.
>
> As an exmaple: the current glibc code I have seems to be uniformly and
> boringly middle-of-the-road. It really doesn't seem to be horrible at all.
> That said, the above thing _is_ 2-3x faster for me on that read-read case
> (from single thread up to 20 threads - but just tested on one particular
> machine), so if that is what you want to aim for, it's certainly easy to
> beat.
>
> But for the write case, while I can easily beat it for a low-thread count
> with little contention, my example thing above benchmarks about equal for
> bigger thread counts (caveat: again - I've only tested on one machine,
> which is a single-socket dual-core thing. Caveat emptor).
>
> And the pthreads implementation actually beats my hacked-up one at least
> for the 5% random writelocks case. It's not beating it by a huge amount,
> but it's not in the noise level either (maybe 15%). And I bet the pthreads
> implementation is a hell of a lot better tested ;)

Had a bit of a look though this, seems pretty OK to me. Obviously with
rwlocks it *is* impossible to do non-atomic unlocks, so I can't see
a way to significantly improve your code there.

What you *could* maybe do, to slightly speed up the reader fastpath, at
the expense of the writer fastpath, is to also have the active writer add
4 to the count too, so your unlock can start with a lock xadd -4, count
in order to get the write-intent on the cacheline straight up.

Though that's a pretty tiny "optmisation", and not going to be an amazing
win, even when it does save a state transition on the cacheline...

I'd be more interested to know why this code can't be evolved into a full
rwlock implementation? This is a rather standard (though neat) looking rwlock
-- so my question is what can the patented 64-bit futex locks do that this
can't, or what can they do faster?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/