Re: [PATCH] x86: slightly shorten __ticket_spin_trylock() (v2)

From: Jan Beulich
Date: Fri Dec 11 2009 - 10:32:35 EST


>>> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> 03.12.09 22:10 >>>
>On Thu, 3 Dec 2009, Jan Beulich wrote:
>> @@ -88,12 +88,11 @@ static __always_inline int __ticket_spin
>> LOCK_PREFIX "cmpxchgw %w1,%2\n\t"
>> "1:"
>> "sete %b1\n\t"
>> - "movzbl %b1,%0\n\t"
>> : "=&a" (tmp), "=&q" (new), "+m" (lock->slock)
>> :
>> : "memory", "cc");
>>
>> - return tmp;
>> + return new;
>
>This is fairly pessimal register allocation.
>
>It used to be that we returned the value in 'tmp', which is %eax, which is
>also the expected return register. Now that we use 'new', it's some random
>other register that is _not_ %eax, which means that while we avoid a
>'movzbl', the regular spin_trylock function call case will now have the
>compiler emitting a
>
> movb %X,%al
> ret
>
>at the end just to get the right return register.

No, not really, because __spin_trylock() doesn't directly pass on the
return value of _raw_spin_trylock().

>Which seems a bit annoying. It looks like we could have done that last
>instruction as just
>
> sete %b0
>
>instead, and then return 'tmp' instead of 'new', keeping the return value
>in %al and avoiding the unnecessary movement.
>
>> -static __always_inline int __ticket_spin_trylock(raw_spinlock_t *lock)
>> +static __always_inline u8 __ticket_spin_trylock(raw_spinlock_t *lock)
>
>Same thing here, afaik.
>
>NOTE! I haven't actually looked at the generated code, and if we actually
>inline it all the way to the caller, it won't matter (and picking another
>register may even help). But while these helpers are marked
>__always_inline, I _thought_ that the way we actually build the final
>'spin_trylock()' function we end up with a real function in the end.

The code isn't really much different either way: In the old variant, the
compiler generated a xor/set pair, in the variant you suggest it
produces a single movzx. In the _spin_trylock_bh() case the new
variant actually produces an extra instruction (copying from %eax to
another register), but all that certainly also depends on the compiler
version.

Hence I'm really uncertain which of both methods is preferable. The
one additional benefit to the version you suggest is that it permits
relaxing the constraint of 'new' from q to r (permitting the compiler
to pick from a wider set of registers on 32-bit). But since there are
only very few direct callers of _raw_spin_trylock(), this is marginal,
as it doesn't matter for _spin_trylock() (I don't think the lock
debugging case really needs much attention on performance).

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/