Re: Why Semaphore Hardware-Dependent?

From: David Howells
Date: Thu Sep 14 2006 - 07:42:13 EST


Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:

> > NAK for FRV. Do not use atomic_cmpxchg() there as it isn't strictly
> > atomic (FRV only has one strictly atomic operation: SWAP). Please leave
> > FRV as using the spinlock version which is more efficient on UP.
>
> From what I can read of the asm and the documentation, it is atomic. If
> it were not atomic then it is badly broken.

It is not strictly atomic because there is no way to do it strictly atomically
with the instruction set available on the FRV CPUs. What I have is okay for
UP, though not the most efficient, but will not work on SMP.

On SMP, atomic_cmpxchg() will have to be implemented with spinlocks unless
Fujitsu add something more powerful than SWAP to the available instructions.

In which case, the spinlock-based rwsems are instantly better because you only
need to take the spinlocks once, not several times, in the slow-path case.

And on UP, the spinlocks devolve to anti-preemption controls and interrupt
disablement[*], all of which is very quick and can be packed.

[*] Using JIT disablement; *actually* disabling the interrupts is very slow.

> BTW. if atomic_cmpxchg is slower than a local_irq_disable+local_irq_enable
> on your architecture then you have probably not implemented cmpxchg well
> because you may as well just implement it with local_irq_disable.

Using local_irq_disable() is not sufficient to implement cmpxchg() on anything
but a UP system, but on a UP system, you don't actually need to do cmpxchg() at
all.

If I have JIT disablement, and I don't have preemption enabled, then the
spinlock wrapping on the fastpath is this:

andcc gr0,gr0,gr0,icc2 // 1 cycle
...
oricc gr0,#1,gr0,icc2 // 1 cycle
tihi icc2,gr0,#2 // 1 cycle if no pending interrupt

which, as long as there weren't any interrupts, is about as fast as I can get
it.

It occurs to me that I could possibly improve the performance of the
spinlock-based rwsems by making use of bit 30 of the count to indicate stuff
waiting on the queue.

> I'm not so interested in counting cycles so much as consolidating duplicated
> code

There's a reason that the common parts of the code are in lib/.

> and reduce complexity

Not so. The spinlock-based rwsems are less complex.

> and icache footprint.

And smaller.

And the optimised rwsems don't necessarily have a larger footprint. After all,
given what Ingo has done and moved them completely out-of-line, means that the
compiler actually has to encode a call to them. Which means that some of the
values it currently has in registers are going to get clobbered, and it has to
arrange for whatever to be in the right registers. Now I grant you that this
is very much arch dependent, and mostly applies to i386 where only three
registers get clobbered by a call (which we can reduce to one).

> atomic_cmpxchg exists on all architectures.

But that doesn't mean you should blindly use it without consideration of the
consequences, or use it when there's a better way available.

> I'm happy to go with the spinlock version (it is even simpler), but I will
> have to benchmark that. I have seen small slowdowns there in high contention
> situations but it was improved with my rwsem scalability patch. If
> performance is no different between the two, then the spinlock version is a
> no brainer.

Using the spinlock-based rwsems in preference to XADD, for instance, is
potentially a source of greater contention, because (hopefully) XADD, in the
ideal case, will touch the cacheline containing the rwsem once, and that'll be
it, whereas with the spinlock version you have to touch the cacheline at least
four times (spinlock grab, examine data, modify data, spinlock drop), and that
means the cacheline can go elsewhere in between.

Using an emulated cmpxchg in preference to the spinlock-based rwsems, on the
other hand, may be as bad in the fast path because you may have to get an
implicit spinlock in the cmpxchg implementation, and in the slowpath it's worse
because not only do you have to get the rwsem-spinlock overall, but you also
have to get the cmpxchg-lock at regular intervals. So you've got two
cachelines potentially bouncing around.

Note that the same arguments apply to the use of just CMPXCHG (eg: sparc) or to
ST/LL or equivalent constructs (eg: mips, alpha, powerpc, arm6) because there's
a gap between the store and the load, and that gives an opportunity for the
cacheline to get stolen, and if that happens, there's a chance that the
instruction will have to be repeated.

As far as I can tell, XADD or direct equivalent is the only one where this
isn't true, but even that may be implemented by ST/LL-equivalents internally in
the CPU.

As I said before, if all you've got is CMPXCHG, you can actually do better than
either algorithm we currently have.

> my rwsem scalability patch

Ummm... I don't remember what that changed; can you send it to me again?


What is it you're trying to achieve anyway? Are you trying to do
generalisation for generalisation's sake? If so, consider that that might not
be the best thing, and that you might end up with something that rocks for one
or two archs and sucks for the rest.

Look at the genirq thing for an example of that... :-)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/