Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
From: David Laight
Date: Mon Jun 09 2025 - 17:19:28 EST
On Thu, 5 Jun 2025 18:47:33 +0200
Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> comes with a significant penalty on CPUs without the respective fast
> short ops bits (FSRM/FSRS).
>
> Another point is that even uarchs with FSRM don't necessarily have FSRS (Ice
> Lake and Sapphire Rapids don't).
>
> More importantly, rep movsq is not fast even if FSRM is present.
Which architecture is that?
I got exactly the same timings for 'rep movsb' and 'rep movsq' when
I did some tests on Intel cpu going back to Ivy bridge.
I do need to redo them though, I've worked out how to time them
without using mfence/lfence and that should give a reasonable
estimation of the setup cost.
(I can measure the data-dependency of a single divide...)
David