Re: RFC: Petition Intel/AMD to add POPF_IF insn

From: Linus Torvalds
Date: Wed Aug 17 2016 - 15:54:10 EST


On Wed, Aug 17, 2016 at 12:35 PM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>
> Experimentally, POPF is stupidly slow _always_. 6 cycles
> even if none of the "scary" flags are changed.

6 cycles is nothing.

That's basically the overhead of "oops, I need to use the microcode sequencer".

One issue is that the intel decoders (AMD too, for that matter) can
only generate a fairly small set of uops for any instruction. Some
instructions are really trivial to decode (popf definitely falls under
that heading), but are more than just a couple of uops, so you end up
having to use the uop sequencer logic.

According to Agner Fog's tables, there's one or two
micro-architectures that actually dot he simple "popf" case with a
single cycle throughput, but that's the very unusual case.

You can't even fit the "pop a value, see if only the arithmetic flags
changed, trap to microcode otherwise" into the three of four uops that
the "complex decoder" can generate directly.

And that "fall back to the uop sequencer engine" tends to just always
cause several cycles regardless. So yes, microcode tends to be slow
even for what would otherwise be trivial operations. You'd think Intel
could do as well as they do for the L0 uop cache, but afaik they
don't.

Anyway, six cycles is fast. I'd *love* for popf to actually be just 6
cycles when IF changes. It's much much worse iirc (although honestly,
I haven't timed it in years - it's much easier to time just the
arithmetic flag changes).

It used to be more like a hundred cycles on Prescott.

Linus