Re: [PATCH] rcu: use try_cmpxchg in check_cpu_stall

From: Paul E. McKenney
Date: Tue Feb 28 2023 - 16:29:17 EST


On Tue, Feb 28, 2023 at 04:03:24PM -0500, Steven Rostedt wrote:
> On Tue, 28 Feb 2023 20:39:30 +0000
> Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> > On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > > check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
> > > this change saves a compare after cmpxchg (and related move instruction in
> > > front of cmpxchg).
> >
> > In my codegen, I am not seeing mov instruction before the cmp removed, how
> > can that be? The rax has to be populated with a mov before cmpxchg right?
> >
> > So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> > Where as cmpxchg gives: mov, cmpxchg, mov, jne
> >
> > So yeah you got rid of compare, but I am not seeing reduction in moves.
> > Either way, I think it is an improvement due to dropping cmp so:
>
> Did you get the above backwards?
>
> Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
> Uros sent to me for the ring buffer, the code went from:
>
> 0000000000000070 <ring_buffer_record_off>:
> 70: 48 8d 4f 08 lea 0x8(%rdi),%rcx
> 74: 8b 57 08 mov 0x8(%rdi),%edx
> 77: 89 d6 mov %edx,%esi
> 79: 89 d0 mov %edx,%eax
> 7b: 81 ce 00 00 10 00 or $0x100000,%esi
> 81: f0 0f b1 31 lock cmpxchg %esi,(%rcx)
> 85: 39 d0 cmp %edx,%eax
> 87: 75 eb jne 74 <ring_buffer_record_off+0x4>
> 89: e9 00 00 00 00 jmp 8e <ring_buffer_record_off+0x1e>
> 8a: R_X86_64_PLT32 __x86_return_thunk-0x4
> 8e: 66 90 xchg %ax,%ax
>
>
> To
>
> 00000000000001a0 <ring_buffer_record_off>:
> 1a0: 8b 47 08 mov 0x8(%rdi),%eax
> 1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
> 1a7: 89 c2 mov %eax,%edx
> 1a9: 81 ca 00 00 10 00 or $0x100000,%edx
> 1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
> 1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
> 1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
> 1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
> 1bb: 89 c2 mov %eax,%edx
> 1bd: 81 ca 00 00 10 00 or $0x100000,%edx
> 1c3: f0 0f b1 11 lock cmpxchg %edx,(%rcx)
> 1c7: 75 f2 jne 1bb <ring_buffer_record_off+0x1b>
> 1c9: e9 00 00 00 00 jmp 1ce <ring_buffer_record_off+0x2e>
> 1ca: R_X86_64_PLT32 __x86_return_thunk-0x4
> 1ce: 66 90 xchg %ax,%ax
>
>
> It does add a bit more code, but the fast path seems better (where the
> cmpxchg succeeds). That would be:
>
> 00000000000001a0 <ring_buffer_record_off>:
> 1a0: 8b 47 08 mov 0x8(%rdi),%eax
> 1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
> 1a7: 89 c2 mov %eax,%edx
> 1a9: 81 ca 00 00 10 00 or $0x100000,%edx
> 1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
> 1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
> 1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
> 1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
>
>
> Where there's only two moves and no cmp, where the former has 3 moves and a
> cmp in the fast path.

All well and good, but the stall-warning code is nowhere near a fastpath.

Is try_cmpxchg() considered more readable in this context?

Thanx, Paul