Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

From: Andy Lutomirski
Date: Sun Mar 08 2015 - 09:59:42 EST


On Mar 8, 2015 4:55 AM, "Ingo Molnar" <mingo@xxxxxxxxxx> wrote:
>
>
> * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> > On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> > >
> > > We could save the same 10 cycles from page fault overhead as well,
> > > AFAICS.
> >
> > Are trap gates actually noticeably faster? Or is it just he
> > "conditional_sti()" you're worried about?
>
> ( I'll talk about the CR2 complication later, please ignore that
> problem for a moment. )
>
> So I base my thinking on the following hierarchy of fast paths. In a
> typical x86 system there are 4 main types of 'context switches', in
> order of decreasing frequency:
>
> - syscall 'context switch': entry/exit from syscall
> - trap/fault 'context switch': entry/exit from trap/fault
> - irq 'context switch': entry/exit from irqs
> - task 'context switch': scheduler context-switch
>
> Where each successive level is about an order of magnitude less
> frequently executed on a typical system than the previous one, so to
> optimize a level we are willing to push overhead to the next one(s).
>
> So the primary payoff in executing much of the entry code with irqs
> enabled would be to allow 64-bit *syscalls* to be made without irqs
> disabled: the first, most important level of context entries.
>
> Doing that would give us four (theoretical) performance advantages:
>
> - No implicit irq disabling overhead when the syscall instruction is
> executed: we could change MSR_SYSCALL_MASK from 0xc0000084 to
> 0xc0000284, which removes the implicit CLI on syscall entry.
>
> - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
> system_call.
>
> - No explicit irq disabling overhead in the ret_from_sys_call fast
> path, i.e. no DISABLE_INTERRUPTS() [CLI].
>
> - No implicit irq enabling overhead in ret_from_sys_call's
> USERGS_SYSRET64: the SYSRETQ instruction would not have to
> re-enable irqs as the user-space IF in R11 would match that of the
> current IF.
>
> whether that's an actual performance win in practice as well needs to
> be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles
> range: which is absolutely huge in terms of system_call optimizations.
>
> Now do I think this could be done realistically? I think yes (by
> re-using the NMI code's paranoid entry codepaths for the irq code as
> well, essentially fixing up the effects of any partial entries in an
> irq entry slow path), but I could be wrong about that.
>
> My main worry isn't even feasibility but maintainability and general
> robustness: all these asm cleanups we are doing now could enable such
> a trick to be pulled off robustly.
>
> But I think it could be done technically, because the NMI code already
> has to be careful about 'preempting' partial entries, so we have the
> logic.
>
> Complications:
>
> - We now enable double partial entries: partial syscall interrupted
> by an irq interrupted by an NMI context. I think it should all work
> out fine but it's a new scenario.
>
> - We'd enable interruptible return from system call, which caused
> trouble in the past. Solvable IMHO, by being careful in the irq
> entry code.
>
> - We'd now have to be extra careful about the possibility of
> exceptions triggered in the entry/exit code itself, triggered by
> any sort of unusual register content or MMU fault.
>
> Simplifications:
>
> - I'd ruthlessly disable IRQs for any sort of non fast path: for
> example 32-bit compat entries, ptrace or any other slowpath - at
> least initially until we map out the long term effects of this
> optimization.
>
> Does this scare me? Yes, I think it should scare any sane person, but
> I don't think it's all that bad: all the NMI paranoidentry work has
> already the trail blazed, and most of the races will be readily
> triggerable via regular irq loads, so it's not like we'll leave subtle
> bugs in there.
>
> Being able to do the same with certain traps, because irq entry is
> careful about partial entry state, would just be a secondary bonus.
>
> Regarding the CR2 value on page faults:
>
> > Anyway, for page faulting, we traditionally actually wanted an
> > interrupt gate, because of how we wanted to avoid interrupts coming
> > in and possibly messing up %cr2 due to vmalloc faults, but more
> > importantly for preemption. [...]
>
> Here too I think we could take a page from the NMI code: save cr2 in
> the page fault asm code, recognize from the irq code when we are
> interrupting that and dropping into a slowpath that saves cr2 right
> there. Potentially task-context-switching will be safe after that.
>
> Saving cr2 in the early page fault code should be much less of an
> overhead than what the IRQ disabling/enabling costs, so this should be
> a win. The potential win could be similar to that of system calls:
>
> - Removal of an implicit 'CLI' in irq gates
>
> - Removal of the explicit 'STI' in conditional_sti in your proposed
> code
>
> - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in
> error_exit/retint_swapgs.
>
> - Removal of an implicit 'STI' when SYSRET enables interrupts from R11
>
> and the same savings would apply to FPU traps as well. I'd leave all
> other low frequency traps as interrupt gates: #GP, debug, int3, etc.
>
> > [...] vmalloc faults are "harmless" because we'll notice that it's
> > already done, return, and then re-take the real fault. But a
> > preemption event before we read %cr2 can cause bad things to happen:
> >
> > - page fault pushes error code on stack, address in %cr2
> >
> > - we don't have interrupts disabled, and some interrupt comes in and
> > causes preemption
> >
> > - some other process runs, take another page fault. %cr2 now is the
> > wrong address
> >
> > - we go back to the original thread (perhaps on another cpu), which
> > now reads %cr2 for the wrong address
> >
> > - we send the process a SIGSEGV because we think it's accessing
> > memory that it has no place touching
>
> I think none of this corruption happens if an interrupting context is
> aware of this and 'fixes up' the entry state accordingly. Am I missing
> anything subtle perhaps?
>
> This would be arguably new (and tricky) code, as today the NMI code
> solves this problem by trying to never fault and thus never corrupt
> CR2. But ... unlike NMIs, this triggers so often via a mix of regular
> traps and regular irqs that I'm pretty sure it can be pulled off
> robustly.
>
> > So the page fault code actually *needs* interrupts disabled until we
> > read %cr2. Stupid x86 trap semantics where the error code is on the
> > thread-safe stack, but %cr2 is not.
> >
> > Maybe there is some trick I'm missing, but on the whole I think
> > "interrupt gate + conditional_sti()" does have things going for it.
> > Yes, it still leaves NMI as being special, but NMI really *is*
> > special.
>
> So I think it's all doable, the payoffs are significant in terms of
> entry speedups.
>
> But only on a clean code base, as the x86 entry assembly code was way
> beyond any sane maintainability threshold IMHO - it's fortunately
> improving rapidly these days due to the nice work from Andy and Denys!
>
> What do you think?

>From recent memory, rdmsr of the gs base is around 60 cycles. I
haven't benchmarked cli and sti lately, but I found some references
suggesting 2-3 cycles each.

There's another problem, though: We don't have a real stack pointer
just after syscall and just before sysexit, and therefore we *must*
use IST for anything that can happen while we still have
user-controlled rsp. That includes #DB, #NM, and #MC. With your
proposal, we'd need to add all the IRQs to that list. We could do
that -- maybe we'd just switch back off the IST stack immediately, but
this gets nasty with paravirt, where we don't currently know the full
set of RIP values that have funny stack pointers.

Hmm. We already use weird IRQ stacks, though. Maybe we could bounce
directly from IST to IRQ stack on IRQ entry. Of course, this gets
rather confused if we have nested IRQs, but that seems to already work
correctly, so maybe it's not a big deal. The really gross part would
be scheduling on IRQ exit.

A different optimization that could be interesting: stop using swapgs
entirely. Instead we could use per-cpu pgds. This hurts context
switches, possibly quite a lot, but it completely removes the swapgs
overhead and shrinks the kernel (no gs prefixes).

We could also use a GPR as a percpu pointer. We'd just have to fish
it out somewhere on syscall entry -- maybe we could use percpu syscall
entries :)

This changes somewhat with rdgsbase and wrgsbase. Andi was working on
that -- what happened to it? It should me much simpler now with the
entry cleanups.

Grumble, AMD really messed up syscalls. Also, swapgs is lousy. They
should have given us separate user/kernel gsbase or separate
positive/negative pgds.

--Andy

>
> Thanks,
>
> Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/