Re: queued spinlock code and results

From: Davide Libenzi
Date: Mon Jul 09 2007 - 15:47:18 EST


On Mon, 9 Jul 2007, Linus Torvalds wrote:

> On Mon, 9 Jul 2007, Davide Libenzi wrote:
> >
> > So in this box, and in this test, the double-short Z-lock seems faster
> > than a double-byte. I've no idea why, since it uses two ops more and an
> > extra register.
>
> At this kind of level, the exact instruction scheduling can make a big
> difference.
>
> The extra register usage won't matter if there is no register pressure,
> and any extra instructions can actually happen to *help*, if they end up
> just aligning something just the right way.
>
> There can also be various random effects of prefixes: decoding x86
> instructions is basically a very uarch-specific issue, and for all we know
> it might be that the AMD setup may well end up behaving differently from
> most Intel chips (and within the Intel family, the netburst situation is
> likely different from the other P6-derived cores).
>
> For example, does a single prefix decode faster? It could be that the
> combination of "lock" _and_ "opsize" prefixes is problematic (as in a
> 16-bit locked "lock xaddw"), and causes a decode hickup, but that "lock"
> and "opsize" on their own don't cause any decoder issues (ie doing the
> "lock" on the 32-bit xadd, and just the "opsize" prefix on the 16-bit decw
> both are fast).
>
> But on another uarch it might work out the other way: if "lock" is always
> a complex op, then having a opsize prefix on that one might be "free", and
> then you're better combining them for the locked 16-bit xadd, and having
> the releasing "decb" not have any prefix at all.
>
> And regardless of that, just a random "it happened to get aligned that
> way" (where "alignment" might be about hitting the cache-line just right,
> but might also be about just having the right instruction mix to get the
> intel decoders to run at their full 4-1-1-1 capacity), causing the timing
> differences.
>
> So before taking these numbers as any kind of "real" values, I'd suggest:
>
> - trying it out on at least a few different uarchs (Opteron, P4 and Core
> 2 all have quite different restrictions on decoding)
>
> - possibly trying it out with things in different order and different
> compiler options (-O2 vs -Os), trying to cause different kinds of
> alignment issues.
>
> Also, just a small nit: in the kernel, the locking would _not_ be inlined
> (but the unlocking would), so marking the lock functions "inline" is
> probably a bad idea. Without the inline, it's likely more realistic, and
> the effects of register pressure will be hidden. Because of the uninlining
> nature of locks, I think you can generally ignore the "one or two
> registers" issue - you'll have three caller-clobbered registers to play
> with regardless.

Indeed, with no inline, on a P4 (with -O2), numbers betweeen xadd-lock
and zadd-lock gets closer:

inc-lock in cache takes 35.15ns
xadd-lock in cache takes 43.84ns
vadd-lock in cache takes 53.51ns
zadd-lock in cache takes 43.28ns
inc-lock out of cache takes 122.92ns
xadd-lock out of cache takes 126.98ns
vadd-lock out of cache takes 172.36ns
zadd-lock out of cache takes 126.01ns

The always-lfence instruction in vadd-lock really is painfull though.
If numbers are close, and given that spinlock size considering structure
alignments should not matter much, wouldn't it be better to use a double
short and remove the 256 CPUs cap?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/