Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

From: Andy Lutomirski
Date: Tue Mar 10 2015 - 09:27:18 EST


On Tue, Mar 10, 2015 at 6:21 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>
>> > So there are now +2 instructions (5 instead of 3) in the
>> > system_call path, but there are -2 instructions in the SYSRETQ
>> > path,
>>
>> Unfortunately, no. [...]
>
> So I assumed that it was an equivalent transformation, given that none
> of the changelogs spelled out the increase in overhead ...
>
>> [...] There is only this change in SYSRETQ path, which simply
>> changes where we get RSP from:
>>
>> @@ -293,7 +289,7 @@ ret_from_sys_call:
>> CFI_REGISTER rip,rcx
>> movq EFLAGS(%rsp),%r11
>> /*CFI_REGISTER rflags,r11*/
>> - movq PER_CPU_VAR(old_rsp), %rsp
>> + movq RSP(%rsp),%rsp
>> /*
>> * 64bit SYSRET restores rip from rcx,
>> * rflags from r11 (but RF and VM bits are forced to 0),
>>
>> Most likely, no change in execution speed here.
>> At best, it is one cycle faster somewhere in address generation unit
>> because for PER_CPU_VAR() address evaluation, GS base is nonzero.
>>
>> Since this patch does add two extra MOVs,
>> I did benchmark these patches. They add exactly one cycle
>> to system call code path on my Sandy Bridge CPU.
>
> Hm, but that's the wrong direction, we should try to make it faster,
> and to clean it up - but making it slower without really good reasons
> isn't good.
>
> Is 'usersp' really that much of a complication?

usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK /
RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
killing that off completely. I've still never convinced myself that
there aren't ptrace-related info leaks in there.

Denys, did you ever benchmark what happens if we use push instead of
mov? I bet that we get that cycle back and more, not to mention much
less icache usage.

The reason I think that is that this patch changes us from writing
nothing to the RIP slot to writing something there. If we push our
stack frame instead of moving it into place, we must write to every
slot, and writing the right value is probably no more expensive than
writing, say, zero (the load / forwarding units are unlikely to be
very busy at that point). So this change could actually be a step
toward getting faster.

--Andy

>
> Thanks,
>
> Ingo



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/