Re: [RFC 0/7] Prep code for better stack switching

From: Andy Lutomirski
Date: Sun Nov 12 2017 - 23:38:24 EST


On Sat, Nov 11, 2017 at 8:25 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> On Sat, Nov 11, 2017 at 6:59 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> On Sat, Nov 11, 2017 at 2:58 AM, Borislav Petkov <bp@xxxxxxx> wrote:
>>> On Fri, Nov 10, 2017 at 08:05:19PM -0800, Andy Lutomirski wrote:
>>>> This isn't quite done (the TSS remap patch is busted on 32-bit, but
>>>> that's a straightforward fix), but it should be ready for at least a
>>>> conceptual review.
>>>>
>>>> The idea here is to prepare us to have all kernel data needed for
>>>> user mode execution and early entry located in the fixmap. To do
>>>> this, I hijack the GDT remap mechanism and make it more general. I
>>>> add a struct cpu_entry_area. This struct is never instantiated
>>>> directly. Instead, it represents the layout of a per-cpu portion of
>>>> the fixmap. That portion contains the GDT, the TSS (including IO
>>>> bitmap), and the entry stack (for now just a part of the TSS
>>>> region). It should also end up containing the PEBS and BTS buffers.
>>>>
>>>> If this works, then the idea would be to add a magic *executable* page
>>>> to cpu_entry_area. That page would contain a stub like this:
>>>>
>>>> ENTRY(entry_SYSCALL_64_trampoline)
>>>> UNWIND_HINT_EMPTY
>>>> movq %rsp, 0x1000+entry_SYSCALL_64_trampoline-1f(%rip)
>>>> 1:
>>>> movq 0x1008+entry_SYSCALL_64_trampoline-1f(%rip), %rsp
>>>> 1:
>>>> pushq %rdi
>>>> pushq %rsi
>>>
>>>> movq 0x1000+entry_SYSCALL_64_trampoline-1f(%rip), %rsi
>>>> 1:
>>>> movq $entry_SYSCALL_64, %rdi
>>>> jmp *%rdi
>>>
>>> So I'm wondering: r12-r15 are callee-preserved so why can't you
>>> scratch into those on entry and leave rsi and rdi pristine so that
>>> entry_SYSCALL_64 can get to work directly?
>>
>> I'm not sure I understand your suggestion. SYSCALL has always
>> preserved all regs except rcx, r11, flags, rax, and, depending on what
>> signals are involved, the argument registers. r12-r15 are definitely
>> preserved, and existing userspace relies on that.
>>
>> Anyway, I'm halfway through actually implementing this, and it looks a
>> wee bit different, but not much different.
>
>
> Here it is:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_stack.wip&id=96a6ab74088a86f6b9b6df8284c6466e4fa50d08
>
> Seems to work for me.
>
> Dave, want to see if you can get this working cleanly without mapping
> any percpu variables at all? You'll probably have to move PEBS, etc
> into cpu_entry_area. For now, it should be safe to just ignore the
> LDT. I'm somewhat tempted to just adjust your code so that the fixmap
> ends up being mapped separately for LDT-using tasks rather than
> mucking with putting the LDT in the user address range. The latter
> involves a little more mm magic than I really want to deal with if I
> can avoid it.

If any of you are playing with the full series (the stuff in my tree,
not the stuff in this email), don't try to use it with excessive
amounts of tracing on or with CONFIG_CONTEXT_TRACKING_FORCE -- it'll
explode horribly. I see the root cause, and I'll fix it soon.