Re: [PATCH] rcu: use try_cmpxchg in check_cpu_stall

From: Steven Rostedt
Date: Tue Feb 28 2023 - 16:04:48 EST


On Tue, 28 Feb 2023 20:39:30 +0000
Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:

> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).
>
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
>
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
>
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:

Did you get the above backwards?

Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
Uros sent to me for the ring buffer, the code went from:

0000000000000070 <ring_buffer_record_off>:
70: 48 8d 4f 08 lea 0x8(%rdi),%rcx
74: 8b 57 08 mov 0x8(%rdi),%edx
77: 89 d6 mov %edx,%esi
79: 89 d0 mov %edx,%eax
7b: 81 ce 00 00 10 00 or $0x100000,%esi
81: f0 0f b1 31 lock cmpxchg %esi,(%rcx)
85: 39 d0 cmp %edx,%eax
87: 75 eb jne 74 <ring_buffer_record_off+0x4>
89: e9 00 00 00 00 jmp 8e <ring_buffer_record_off+0x1e>
8a: R_X86_64_PLT32 __x86_return_thunk-0x4
8e: 66 90 xchg %ax,%ax


To

00000000000001a0 <ring_buffer_record_off>:
1a0: 8b 47 08 mov 0x8(%rdi),%eax
1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
1a7: 89 c2 mov %eax,%edx
1a9: 81 ca 00 00 10 00 or $0x100000,%edx
1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
1bb: 89 c2 mov %eax,%edx
1bd: 81 ca 00 00 10 00 or $0x100000,%edx
1c3: f0 0f b1 11 lock cmpxchg %edx,(%rcx)
1c7: 75 f2 jne 1bb <ring_buffer_record_off+0x1b>
1c9: e9 00 00 00 00 jmp 1ce <ring_buffer_record_off+0x2e>
1ca: R_X86_64_PLT32 __x86_return_thunk-0x4
1ce: 66 90 xchg %ax,%ax


It does add a bit more code, but the fast path seems better (where the
cmpxchg succeeds). That would be:

00000000000001a0 <ring_buffer_record_off>:
1a0: 8b 47 08 mov 0x8(%rdi),%eax
1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
1a7: 89 c2 mov %eax,%edx
1a9: 81 ca 00 00 10 00 or $0x100000,%edx
1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
1b7: R_X86_64_PLT32 __x86_return_thunk-0x4


Where there's only two moves and no cmp, where the former has 3 moves and a
cmp in the fast path.

-- Steve