Re: NMI vs #PF clash

From: Brian Gerst
Date: Tue May 22 2012 - 21:26:14 EST


On Tue, May 22, 2012 at 8:39 PM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> On Tue, 2012-05-22 at 08:33 -0700, Linus Torvalds wrote:
>> On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>> >
>> > Is reading it fast? Then we could do a two reads and only write when
>> > needed.
>>
>> Even better: we could do nothing at all.
>>
>> We could just say: let's make sure that any #PF case that can happen
>> in #NMI can also be re-done with arbitrary 'error_code' and 'struct
>> regs' contents.
>>
>> At that point, what could happen is
>> Â- #PF
>> Â - NMI
>> Â Â- #PF
>> Â Â - read cr2 for NMI fault
>> Â Â - handle the NMI #PF
>> Â Â - return from #PF
>> Â - return from #NMI
>> Â - read cr2 for original #PF fault - but get the NMI cr2 again
>> Â - hande the #PF again (this should be a no-op now)
>
> I tested this by forcing all NMIs to have a page fault, and then running
> perf against hackbench to see what breaks. This doesn't work because the
> cr2 contains where the fault happened.
>
> If the NMI faulting address was in kernel space, the page fault it
> interrupted will think the user task is trying to access kernel memory
> and fault.
>
> Basically, the address will not be valid for the user task and this will
> make the task take a SEGFAULT.
>
> I had the NMI fault on the address 0x12348, but had a exception table to
> handle it.
>
> hackbench[2236]: segfault at 12348 ip 0000003054a091c4 sp 00007fff22c6e840 error 4 in ld-2.12.2.so[3054a00000+1e000]
>
> With the below patch, the faulting NMIs don't seem to be harming anything.
>
>
>> Â - return from #PF
>> Â- instruction restart causes new #PF
>> Â - now we do the original page fault
>>
>> So one option is to just make sure that the few cases (just the
>> vmalloc area?) that NMI can trigger are all ok to be re-done with
>> other state.
>>
>> I note that right now we have
>>
>> Â Â Â Â if (unlikely(fault_in_kernel_space(address))) {
>> Â Â Â Â Â Â Â Â if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
>> Â Â Â Â Â Â Â Â Â Â Â Â if (vmalloc_fault(address) >= 0)
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return;
>>
>> and that the error_code check means that the retried NMI #PF would not
>> go through that. But maybe we don't even need that check?
>>
>> That error_code thing seems to literally be the only thing that keeps
>> us from just re-doing the vmalloc_fault() silently.
>
> Avi, you want to test this?
>
> -- Steve
>
> Index: linux-trace.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-trace.git.orig/arch/x86/kernel/entry_64.S
> +++ linux-trace.git/arch/x86/kernel/entry_64.S
> @@ -1586,7 +1586,7 @@ ENTRY(nmi)
> Â Â Â Â * Check the special variable on the stack to see if NMIs are
> Â Â Â Â * executing.
> Â Â Â Â */
> - Â Â Â cmpl $1, -8(%rsp)
> + Â Â Â cmpl $1, -16(%rsp)
> Â Â Â Âje nested_nmi
>
> Â Â Â Â/*
> @@ -1615,7 +1615,7 @@ nested_nmi:
>
> Â1:
> Â Â Â Â/* Set up the interrupted NMIs stack to jump to repeat_nmi */
> - Â Â Â leaq -6*8(%rsp), %rdx
> + Â Â Â leaq -7*8(%rsp), %rdx
> Â Â Â Âmovq %rdx, %rsp
> Â Â Â ÂCFI_ADJUST_CFA_OFFSET 6*8
> Â Â Â Âpushq_cfi $__KERNEL_DS
> @@ -1625,7 +1625,7 @@ nested_nmi:
> Â Â Â Âpushq_cfi $repeat_nmi
>
> Â Â Â Â/* Put stack back */
> - Â Â Â addq $(11*8), %rsp
> + Â Â Â addq $(12*8), %rsp
> Â Â Â ÂCFI_ADJUST_CFA_OFFSET -11*8
>
> Ânested_nmi_out:
> @@ -1650,6 +1650,8 @@ first_nmi:
> Â Â Â Â * +-------------------------+
>     * | temp storage for rdx  Â|
> Â Â Â Â * +-------------------------+
> + Â Â Â Â* | saved cr2 Â Â Â Â Â Â Â |
> + Â Â Â Â* +-------------------------+
> Â Â Â Â * | NMI executing variable Â|
> Â Â Â Â * +-------------------------+
> Â Â Â Â * | Saved SS Â Â Â Â Â Â Â Â|
> @@ -1672,8 +1674,20 @@ first_nmi:
> Â Â Â Â * to the repeat_nmi. The original stack frame and the temp storage
> Â Â Â Â * is also used by nested NMIs and can not be trusted on exit.
> Â Â Â Â */
> +
> + Â Â Â /*
> + Â Â Â Â* Save off the CR2 register. If we take a page fault in the NMI then
> + Â Â Â Â* it could corrupt the CR2 value. If the NMI preempts a page fault
> + Â Â Â Â* handler before it was able to read the CR2 register, and then the
> + Â Â Â Â* NMI itself takes a page fault, the page fault that was preempted
> + Â Â Â Â* will read the information from the NMI page fault and not the
> + Â Â Â Â* origin fault. Save it off and restore it if it changes.
> + Â Â Â Â*/
> + Â Â Â movq %cr2, %rdx
> + Â Â Â pushq_cfi %rdx
> +
> Â Â Â Â/* Do not pop rdx, nested NMIs will corrupt that part of the stack */
> - Â Â Â movq (%rsp), %rdx
> + Â Â Â movq 8(%rsp), %rdx
> Â Â Â ÂCFI_RESTORE rdx
>
> Â Â Â Â/* Leave room for the "in NMI" variable. */
> @@ -1682,7 +1696,7 @@ first_nmi:
>
> Â Â Â Â/* Copy the stack frame to the Saved frame */
> Â Â Â Â.rept 5
> - Â Â Â pushq_cfi 6*8(%rsp)
> + Â Â Â pushq_cfi 7*8(%rsp)
> Â Â Â Â.endr
> Â Â Â ÂCFI_DEF_CFA_OFFSET SS+8-RIP
>
> @@ -1734,6 +1748,13 @@ end_repeat_nmi:
> Ânmi_swapgs:
> Â Â Â ÂSWAPGS_UNSAFE_STACK
> Ânmi_restore:
> + Â Â Â /* Test if the cr2 reg changed */
> + Â Â Â movq ORIG_RAX-R15+(12*8)(%rsp), %rdx
> + Â Â Â movq %cr2, %rcx
> + Â Â Â cmp %rdx, %rcx
> + Â Â Â je 1f
> + Â Â Â movq %rdx, %cr2
> +1:
> Â Â Â ÂRESTORE_ALL 8
> Â Â Â Â/* Clear the NMI executing stack variable */
> Â Â Â Âmovq $0, 10*8(%rsp)

You could save cr2 in a callee-saved register (like r12) instead of
putting it on the stack.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/