Re: RFC: Petition Intel/AMD to add POPF_IF insn

From: Denys Vlasenko
Date: Thu Aug 18 2016 - 09:26:26 EST


Of course, somebody really should do timings on modern CPU's (in cpl0,
comparing native_fl() that enables interrupts with a popf)

I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
if we set iopl(3) first.

Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI
converges to about ~0.5 insn/cycle:

# compile with: gcc -nostartfiles -nostdlib
_start: .globl _start
mov $172, %eax #iopl
mov $3, %edi
syscall
mov $200*1000*1000, %eax
.balign 64
loop:
cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;cli

cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;sti
dec %eax
jnz loop

mov $231, %eax #exit_group
syscall

perf stat:
6,015,787,968 instructions # 0.52 insn per cycle
3.355474199 seconds time elapsed

With all CLIs replaced by STIs, it's ~0.25 insn/cycle:

6,030,530,328 instructions # 0.27 insn per cycle
6.547200322 seconds time elapsed


POPF which needs to enable interrupts is not measurably faster than
one which does not change .IF:

Loop with:
400158: fa cli
400159: 53 push %rbx #saved eflags with if=1
40015a: 9d popfq
shows:
8,908,857,324 instructions # 0.11 insn per cycle ( +- 0.00% )

Loop with:
400140: fb sti
400141: 53 push %rbx
400142: 9d popfq
shows:
8,920,243,701 instructions # 0.10 insn per cycle ( +- 0.01% )

Even loop with neither CLI nor STI, only with POPF:
400140: 53 push %rbx
400141: 9d popfq
shows:
6,079,936,714 instructions # 0.10 insn per cycle ( +- 0.00% )

This is on a Skylake CPU.


The gist of it:
CLI is 2 cycles,
STI is 4 cycles,
POPF is 10 cycles
seemingly regardless of prior value of EFLAGS.IF.