Re: [RFC 00/30] x86: Rewrite all syscall entries except native 64-bit

From: Brian Gerst
Date: Thu Sep 03 2015 - 01:23:39 EST


On Tue, Sep 1, 2015 at 6:41 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> Here's a monster series that I'm working on. I think it's in decent
> shape now.
>
> The first couple patches are tests and some old stuff. There's a
> test that validates the vDSO AT_SYSINFO annotations (which fails on
> 32-bit Debian systems for some reason that I can't yet fathom
> because fast syscalls simply don't happen on my VM for unknown
> reasons presumably related to glibc bugs or misconfiguration, and I
> need to do something about the test). There's also a test that
> exercises some assumptions that signal handling and ptracers make
> about syscalls that currently do *not* hold on 64-bit AMD using
> 32-bit AT_SYSINFO.
>
> The next few patches are the NT stuff. Ingo, feel free to pretend
> you don't see it until the merge window closes :)
>
> The rest is basically a rewrite of syscalls for all cases except
> 64-bit native. With these patches applied, there is a single 32-bit
> vDSO and it uses SYSCALL, SYSENTER, and INT80 almost interchangeably
> via alternatives. The semantics of SYSENTER and SYSCALL are defined
> as:
>
> 1. If SYSCALL, ESP = ECX
> 2. ECX = *ESP
> 3. IP = INT80 landing pad
> 4. Opportunistic SYSRET/SYSEXIT is enabled on return
>
> The vDSO is rearranged so that these semantics work. Anything that
> backs IP up by 2 ends up pointing at a bona fide int $0x80
> instruction with the expected regs.
>
> In the process, the vDSO CFI annotations (which are actually used)
> get rewritten using normal CFI directives.
>
> Opportunistic SYSRET/SYSEXIT only happens on return when CS and SS
> are as expected, IP points to the INT80 landing pad, and flags are
> in good shape.

I think the opportunistic exit code could be improved a bit more. The
checks are only be necessary if force_iret() was called meaning
registers were changed. One possibility is to add a ti->status flag
TS_FASTSYSCALL. Then we could move the tests to force_iret(), which
would clear the flag if the registers fail validation. The syscall
exit path then would check the flag and exit via IRET if it's clear.
That would reduce the impact of the tests on the fast path where no
regs were changed.

> Other than that, the system call entries are simplified to the bare
> minimum prologue and a call to a C function. Amusingly, SYSENTER
> and SYSCALL32 use the same C function.
>
> To make that work, I had to remove all the 32-bit syscall stubs
> except the clone argument hack. This is because, for C code to call
> through the system call table, the system call table entries need to
> be real function pointers with C-compatible ABIs.
>
> There is nothing at all anymore that requires that x86_32 syscalls
> be asmlinkage. That could be removed in a subsequent patch.

Other arches (at least IA-64) still need asmlinkage or something
equivalent for their syscalls.

asmlinkage_protect() can also be removed.

> The upshot appears to be a ~25 cycle performance hit on 32-bit fast
> path syscalls. The slow path is probably faster under most
> circumstances and, if the exit slow path gets hit, it'll be much
> faster because (as we already do in the 64-bit native case) we can
> still use SYSEXIT/SYSRET.
>
> The patchset is structured as a removal of the old fast syscall
> code, then the change that makes syscalls into real functions, then
> a clean re-implementation of fast syscalls.
>
> If we want some of the 25 cycles back, we could consider open-coding
> a new C fast path.

Is the 25 cycles for the compat or native case? I'd expect the native
case to be hit harder because of register pressure.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/