Re: [PATCH v2 07/36] x86/entry/64/compat: After SYSENTER, move STI after the NT fixup

From: Andy Lutomirski
Date: Fri Oct 09 2015 - 15:49:05 EST


On Wed, Oct 7, 2015 at 10:39 AM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
> On 10/06/2015 02:47 AM, Andy Lutomirski wrote:
>> We eventually want to make it all the way into C code before
>> enabling interrupts. We need to rework our flags handling slightly
>> to delay enabling interrupts.
>>
>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>> ---
>> arch/x86/entry/entry_64_compat.S | 30 ++++++++++++++++++++++--------
>> 1 file changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
>> index aa76864a8a6b..1432d60a1f4a 100644
>> --- a/arch/x86/entry/entry_64_compat.S
>> +++ b/arch/x86/entry/entry_64_compat.S
>> @@ -58,14 +58,9 @@ ENDPROC(native_usergs_sysret32)
>> * with the int 0x80 path.
>> */
>> ENTRY(entry_SYSENTER_compat)
>> - /*
>> - * Interrupts are off on entry.
>> - * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
>> - * it is too small to ever cause noticeable irq latency.
>> - */
>> + /* Interrupts are off on entry. */
>> SWAPGS_UNSAFE_STACK
>> movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>> - ENABLE_INTERRUPTS(CLBR_NONE)
>>
>> /* Zero-extending 32-bit regs, do not remove */
>> movl %ebp, %ebp
>> @@ -76,7 +71,16 @@ ENTRY(entry_SYSENTER_compat)
>> /* Construct struct pt_regs on stack */
>> pushq $__USER32_DS /* pt_regs->ss */
>> pushq %rbp /* pt_regs->sp */
>> - pushfq /* pt_regs->flags */
>> +
>> + /*
>> + * Push flags. This is nasty. First, interrupts are currently
>> + * off, but we need pt_regs->flags to have IF set. Second, even
>> + * if TF was set when SYSENTER started, it's clear by now. We fix
>> + * that later using TIF_SINGLESTEP.
>> + */
>> + pushfq /* pt_regs->flags (except IF = 0) */
>> + orl $X86_EFLAGS_IF, (%rsp) /* Fix saved flags */
>
> The sequence of "push + insn_using_rsp" is a bit slow
> on most CPUs because stack engine (the machinery which makes
> consecutive pushes fast) needs syncronizing with register file.
>
> It may be better to move the ORL insn here:
>
> push, push, push
> cld
> sub $(10*8), %rsp /* pt_regs->r8-11, bp, bx, r12-15 not saved */
> + orl $X86_EFLAGS_IF, EFLAGS(%rsp) /* Fix saved flags to have .IF = 1 */
>
> where we already eat that penalty.
>

I played with it a bit. It seems to save one cycle, which is probably
worthwhile.

I think that the gain is small because neither case is particular bad.
I read Agner Fog's writeup on the stack engine. I think that, with
the code as written in my patch, the Âops are, roughly:

store $__USER32_DS to [rsp - 0]
store $rcx to [rsp - 8]
store flags to [rsp - 12]
sync rsp (rsp -= 12)
<-- wait for ALU here
or $X86_EFLAGS_IF, $rsp (might be two Âops)
store $__USER32_CS to [rsp]
...
sync rsp (rsp -= something)
<-- wait for ALU here
testl ...

with the change, it's:
store $__USER32_DS to [rsp - 0]
store $rcx to [rsp - 8]
store flags to [rsp - 12]
store $__USER32_CS to [rsp-16]
...
sync rsp (rsp -= something)
sync rsp (rsp -= 12)
<-- wait for ALU here
or $X86_EFLAGS_IF, EFLAGS($rsp) (might be two Âops)
testl ...

So the change removes one rsp sync and one wait. On the other hand,
the ALU is probably otherwise idle for most of this sequence, so the
latency is probably almost entirely hidden because the dispatcher can
do the rsp syncs much earlier than they appear before reordering.

Also, deferring the orl makes the code longer and adds an address
calculation, and both of those may have some cost. And moving the orl
right before the testl worsens that dependency. It's too bad we don't
have orx on all supported CPUs.

Anyway, it's still a win, but I'll keep playing. There may be even
better ways to do this.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/