Re: [PATCH] convert cnt32_to_63 to inline

From: Mathieu Desnoyers
Date: Tue Nov 11 2008 - 15:11:46 EST


* Russell King (rmk+lkml@xxxxxxxxxxxxxxxx) wrote:
> On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> > Let's see what it gives once implemented. Only compile-tested. Assumes
> > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> > arm versatile.
>
> Versatile is also UP only.
>
> The following are results from PXA built with gcc 3.4.3:
>
> 1. two additional registers used in sched_clock()
> 2. 8 additional bytes of code (which are needless if gcc was more inteligent)
>
> both of these I put down to inefficiencies in gcc's register allocation.
>
> 3. worse instruction scheduling - two inter-dependent loads next to each
> other causing a pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> Old code:
>
> c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
> 10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
> 14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
> 18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
> 20: e1a07002 mov r7, r2
> 24: e3a08000 mov r8, #0 ; 0x0
> 28: e5933000 ldr r3, [r3] ; read OSCR register
> ...
> 58: e1820b04 orr r0, r2, r4, lsl #22
> 5c: e1a01524 lsr r1, r4, #10
> 60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
>
>
> New code:
>
> c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
> 10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
> 14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
> 18: e1a08001 mov r8, r1
> 1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
> 20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 24: e1a09002 mov r9, r2
> 28: e3a0a000 mov sl, #0 ; 0x0
> 2c: e5933000 ldr r3, [r3] ; read OSCR
> ...
> 58: e1825b04 orr r5, r2, r4, lsl #22
> 5c: e1a06524 lsr r6, r4, #10
> 60: e1a01006 mov r1, r6
> 64: e1a00005 mov r0, r5
> 68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
>
> Versatile:
>
> 1. 12 additional bytes of code
> 2. same number of registers
> 3. worse instruction scheduling causing pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> So, we have two platforms where this patch makes things visibly worse
> with no material benefit.
>

I think the added barrier() are causing these pipeline stalls. They
don't allow the compiler to read variables such as oscr2ns_scale before
the barrier because gcc cannot assume it won't be modified. However, to
insure that OSCR read is done after __m_cnt_hi read, this barrier seems
required to be safe against gcc optimizations.

Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
the macro or to a vanilla tree ?

I wonder if reading those values sooner would help gcc ?

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/