Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

From: Ingo Molnar
Date: Sun Mar 08 2015 - 04:55:36 EST

Next message: Lu, Baolu: "Re: [PATCH 1/1] usb: xhci: handle Config Error Change (CEC) in xhci driver"
Previous message: Jean Delvare: "[PATCH] misc: bh1780: Add module aliases"
Next in thread: Ingo Molnar: "Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> >
> > We could save the same 10 cycles from page fault overhead as well,
> > AFAICS.
>
> Are trap gates actually noticeably faster? Or is it just he
> "conditional_sti()" you're worried about?

( I'll talk about the CR2 complication later, please ignore that
problem for a moment. )

So I base my thinking on the following hierarchy of fast paths. In a
typical x86 system there are 4 main types of 'context switches', in
order of decreasing frequency:

- syscall 'context switch': entry/exit from syscall
- trap/fault 'context switch': entry/exit from trap/fault
- irq 'context switch': entry/exit from irqs
- task 'context switch': scheduler context-switch

Where each successive level is about an order of magnitude less
frequently executed on a typical system than the previous one, so to
optimize a level we are willing to push overhead to the next one(s).

So the primary payoff in executing much of the entry code with irqs
enabled would be to allow 64-bit *syscalls* to be made without irqs
disabled: the first, most important level of context entries.

Doing that would give us four (theoretical) performance advantages:

- No implicit irq disabling overhead when the syscall instruction is
executed: we could change MSR_SYSCALL_MASK from 0xc0000084 to
0xc0000284, which removes the implicit CLI on syscall entry.

- No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
system_call.

- No explicit irq disabling overhead in the ret_from_sys_call fast
path, i.e. no DISABLE_INTERRUPTS() [CLI].

- No implicit irq enabling overhead in ret_from_sys_call's
USERGS_SYSRET64: the SYSRETQ instruction would not have to
re-enable irqs as the user-space IF in R11 would match that of the
current IF.

whether that's an actual performance win in practice as well needs to
be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles
range: which is absolutely huge in terms of system_call optimizations.

Now do I think this could be done realistically? I think yes (by
re-using the NMI code's paranoid entry codepaths for the irq code as
well, essentially fixing up the effects of any partial entries in an
irq entry slow path), but I could be wrong about that.

My main worry isn't even feasibility but maintainability and general
robustness: all these asm cleanups we are doing now could enable such
a trick to be pulled off robustly.

But I think it could be done technically, because the NMI code already
has to be careful about 'preempting' partial entries, so we have the
logic.

Complications:

- We now enable double partial entries: partial syscall interrupted
by an irq interrupted by an NMI context. I think it should all work
out fine but it's a new scenario.

- We'd enable interruptible return from system call, which caused
trouble in the past. Solvable IMHO, by being careful in the irq
entry code.

- We'd now have to be extra careful about the possibility of
exceptions triggered in the entry/exit code itself, triggered by
any sort of unusual register content or MMU fault.

Simplifications:

- I'd ruthlessly disable IRQs for any sort of non fast path: for
example 32-bit compat entries, ptrace or any other slowpath - at
least initially until we map out the long term effects of this
optimization.

Does this scare me? Yes, I think it should scare any sane person, but
I don't think it's all that bad: all the NMI paranoidentry work has
already the trail blazed, and most of the races will be readily
triggerable via regular irq loads, so it's not like we'll leave subtle
bugs in there.

Being able to do the same with certain traps, because irq entry is
careful about partial entry state, would just be a secondary bonus.

Regarding the CR2 value on page faults:

> Anyway, for page faulting, we traditionally actually wanted an
> interrupt gate, because of how we wanted to avoid interrupts coming
> in and possibly messing up %cr2 due to vmalloc faults, but more
> importantly for preemption. [...]

Here too I think we could take a page from the NMI code: save cr2 in
the page fault asm code, recognize from the irq code when we are
interrupting that and dropping into a slowpath that saves cr2 right
there. Potentially task-context-switching will be safe after that.

Saving cr2 in the early page fault code should be much less of an
overhead than what the IRQ disabling/enabling costs, so this should be
a win. The potential win could be similar to that of system calls:

- Removal of an implicit 'CLI' in irq gates

- Removal of the explicit 'STI' in conditional_sti in your proposed
code

- Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in
error_exit/retint_swapgs.

- Removal of an implicit 'STI' when SYSRET enables interrupts from R11

and the same savings would apply to FPU traps as well. I'd leave all
other low frequency traps as interrupt gates: #GP, debug, int3, etc.

> [...] vmalloc faults are "harmless" because we'll notice that it's
> already done, return, and then re-take the real fault. But a
> preemption event before we read %cr2 can cause bad things to happen:
>
> - page fault pushes error code on stack, address in %cr2
>
> - we don't have interrupts disabled, and some interrupt comes in and
> causes preemption
>
> - some other process runs, take another page fault. %cr2 now is the
> wrong address
>
> - we go back to the original thread (perhaps on another cpu), which
> now reads %cr2 for the wrong address
>
> - we send the process a SIGSEGV because we think it's accessing
> memory that it has no place touching

I think none of this corruption happens if an interrupting context is
aware of this and 'fixes up' the entry state accordingly. Am I missing
anything subtle perhaps?

This would be arguably new (and tricky) code, as today the NMI code
solves this problem by trying to never fault and thus never corrupt
CR2. But ... unlike NMIs, this triggers so often via a mix of regular
traps and regular irqs that I'm pretty sure it can be pulled off
robustly.

> So the page fault code actually *needs* interrupts disabled until we
> read %cr2. Stupid x86 trap semantics where the error code is on the
> thread-safe stack, but %cr2 is not.
>
> Maybe there is some trick I'm missing, but on the whole I think
> "interrupt gate + conditional_sti()" does have things going for it.
> Yes, it still leaves NMI as being special, but NMI really *is*
> special.

So I think it's all doable, the payoffs are significant in terms of
entry speedups.

But only on a clean code base, as the x86 entry assembly code was way
beyond any sane maintainability threshold IMHO - it's fortunately
improving rapidly these days due to the nice work from Andy and Denys!

What do you think?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Lu, Baolu: "Re: [PATCH 1/1] usb: xhci: handle Config Error Change (CEC) in xhci driver"
Previous message: Jean Delvare: "[PATCH] misc: bh1780: Add module aliases"
Next in thread: Ingo Molnar: "Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]