Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16

From: Pan Xinhui
Date: Thu Apr 21 2016 - 12:07:07 EST


On 2016å04æ20æ 22:24, Peter Zijlstra wrote:
> On Wed, Apr 20, 2016 at 09:24:00PM +0800, Pan Xinhui wrote:
>
>> +#define __XCHG_GEN(cmp, type, sfx, skip, v) \
>> +static __always_inline unsigned long \
>> +__cmpxchg_u32##sfx(v unsigned int *p, unsigned long old, \
>> + unsigned long new); \
>> +static __always_inline u32 \
>> +__##cmp##xchg_##type##sfx(v void *ptr, u32 old, u32 new) \
>> +{ \
>> + int size = sizeof (type); \
>> + int off = (unsigned long)ptr % sizeof(u32); \
>> + volatile u32 *p = ptr - off; \
>> + int bitoff = BITOFF_CAL(size, off); \
>> + u32 bitmask = ((0x1 << size * BITS_PER_BYTE) - 1) << bitoff; \
>> + u32 oldv, newv, tmp; \
>> + u32 ret; \
>> + oldv = READ_ONCE(*p); \
>> + do { \
>> + ret = (oldv & bitmask) >> bitoff; \
>> + if (skip && ret != old) \
>> + break; \
>> + newv = (oldv & ~bitmask) | (new << bitoff); \
>> + tmp = oldv; \
>> + oldv = __cmpxchg_u32##sfx((v u32*)p, oldv, newv); \
>> + } while (tmp != oldv); \
>> + return ret; \
>> +}
>
> So for an LL/SC based arch using cmpxchg() like that is sub-optimal.
>
> Why did you choose to write it entirely in C?
>
yes, you are right. more load/store will be done in C code.
However such xchg_u8/u16 is just used by qspinlock now. and I did not see any performance regression.
So just wrote in C, for simple. :)

Of course I have done xchg tests.
we run code just like xchg((u8*)&v, j++); in several threads.
and the result is,
[ 768.374264] use time[1550072]ns in xchg_u8_asm
[ 768.377102] use time[2826802]ns in xchg_u8_c

I think this is because there is one more load in C.
If possible, we can move such code in asm-generic/.

thanks
xinhui