Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

From: Uros Bizjak
Date: Mon Jun 09 2025 - 02:04:57 EST


On Sun, Jun 8, 2025 at 10:51 PM David Laight
<david.laight.linux@xxxxxxxxx> wrote:
>
> On Fri, 6 Jun 2025 09:27:07 +0200
> Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > > comes with a significant penalty on CPUs without the respective fast
> > > > short ops bits (FSRM/FSRS).
> > >
> > > I don't suppose there's a magic compiler toggle to make it emit prefix
> > > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > > right?
> > >
> > > Something like:
> > >
> > > 2e 2e 2e f3 a4 cs cs rep movsb %ds:(%rsi),%es:(%rdi)
> > >
> > > because if we can get the compilers to do this; then I can get objtool
> > > to collect all these locations and then we can runtime patch them to be:
> > >
> > > call rep_movs_alternative / rep_stos_alternative
> > >
> > > or whatever other crap we want really.
> >
> > BTW: You can achieve the same effect by using -mstringop-strategy=libcall
> >
> > Please consider the following testcase:
> >
> > --cut here--
> > struct a { int r[40]; };
> > struct a foo (struct a b) { return b; }
> > --cut here--
> >
> > By default, the compiler emits SSE copy (-O2):
> >
> > foo:
> > .LFB0:
> > .cfi_startproc
> > movdqu 8(%rsp), %xmm0
> > movq %rdi, %rax
> > movups %xmm0, (%rdi)
> > movdqu 24(%rsp), %xmm0
> > movups %xmm0, 16(%rdi)
> > ...
> > movdqu 152(%rsp), %xmm0
> > movups %xmm0, 144(%rdi)
> > ret
> >
> > but kernel doesn't enable SSE, so the compiler falls back to (-O2 -mno-sse):
> >
> > foo:
> > movq 8(%rsp), %rdx
> > movq %rdi, %rax
> > leaq 8(%rdi), %rdi
> > leaq 8(%rsp), %rsi
> > movq %rax, %rcx
> > movq %rdx, -8(%rdi)
> > movq 160(%rsp), %rdx
> > movq %rdx, 144(%rdi)
> > andq $-8, %rdi
> > subq %rdi, %rcx
> > subq %rcx, %rsi
> > addl $160, %ecx
> > shrl $3, %ecx
> > rep movsq
> > ret
> >
> > Please note the code that aligns pointers before "rep movsq".
>
> Do you ever want it?
> From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
> alignment makes almost no difference to throughput.

Please note that the instruction is "rep movsQ", it moves 64bit
quantities. The alignment is needed to align data to the 64-bit
boundary.

Uros.