Re: NMI vs #PF clash

From: Steven Rostedt
Date: Tue May 22 2012 - 20:39:39 EST


On Tue, 2012-05-22 at 08:33 -0700, Linus Torvalds wrote:
> On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> > Is reading it fast? Then we could do a two reads and only write when
> > needed.
>
> Even better: we could do nothing at all.
>
> We could just say: let's make sure that any #PF case that can happen
> in #NMI can also be re-done with arbitrary 'error_code' and 'struct
> regs' contents.
>
> At that point, what could happen is
> - #PF
> - NMI
> - #PF
> - read cr2 for NMI fault
> - handle the NMI #PF
> - return from #PF
> - return from #NMI
> - read cr2 for original #PF fault - but get the NMI cr2 again
> - hande the #PF again (this should be a no-op now)

I tested this by forcing all NMIs to have a page fault, and then running
perf against hackbench to see what breaks. This doesn't work because the
cr2 contains where the fault happened.

If the NMI faulting address was in kernel space, the page fault it
interrupted will think the user task is trying to access kernel memory
and fault.

Basically, the address will not be valid for the user task and this will
make the task take a SEGFAULT.

I had the NMI fault on the address 0x12348, but had a exception table to
handle it.

hackbench[2236]: segfault at 12348 ip 0000003054a091c4 sp 00007fff22c6e840 error 4 in ld-2.12.2.so[3054a00000+1e000]

With the below patch, the faulting NMIs don't seem to be harming anything.


> - return from #PF
> - instruction restart causes new #PF
> - now we do the original page fault
>
> So one option is to just make sure that the few cases (just the
> vmalloc area?) that NMI can trigger are all ok to be re-done with
> other state.
>
> I note that right now we have
>
> if (unlikely(fault_in_kernel_space(address))) {
> if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
> if (vmalloc_fault(address) >= 0)
> return;
>
> and that the error_code check means that the retried NMI #PF would not
> go through that. But maybe we don't even need that check?
>
> That error_code thing seems to literally be the only thing that keeps
> us from just re-doing the vmalloc_fault() silently.

Avi, you want to test this?

-- Steve

Index: linux-trace.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-trace.git.orig/arch/x86/kernel/entry_64.S
+++ linux-trace.git/arch/x86/kernel/entry_64.S
@@ -1586,7 +1586,7 @@ ENTRY(nmi)
* Check the special variable on the stack to see if NMIs are
* executing.
*/
- cmpl $1, -8(%rsp)
+ cmpl $1, -16(%rsp)
je nested_nmi

/*
@@ -1615,7 +1615,7 @@ nested_nmi:

1:
/* Set up the interrupted NMIs stack to jump to repeat_nmi */
- leaq -6*8(%rsp), %rdx
+ leaq -7*8(%rsp), %rdx
movq %rdx, %rsp
CFI_ADJUST_CFA_OFFSET 6*8
pushq_cfi $__KERNEL_DS
@@ -1625,7 +1625,7 @@ nested_nmi:
pushq_cfi $repeat_nmi

/* Put stack back */
- addq $(11*8), %rsp
+ addq $(12*8), %rsp
CFI_ADJUST_CFA_OFFSET -11*8

nested_nmi_out:
@@ -1650,6 +1650,8 @@ first_nmi:
* +-------------------------+
* | temp storage for rdx |
* +-------------------------+
+ * | saved cr2 |
+ * +-------------------------+
* | NMI executing variable |
* +-------------------------+
* | Saved SS |
@@ -1672,8 +1674,20 @@ first_nmi:
* to the repeat_nmi. The original stack frame and the temp storage
* is also used by nested NMIs and can not be trusted on exit.
*/
+
+ /*
+ * Save off the CR2 register. If we take a page fault in the NMI then
+ * it could corrupt the CR2 value. If the NMI preempts a page fault
+ * handler before it was able to read the CR2 register, and then the
+ * NMI itself takes a page fault, the page fault that was preempted
+ * will read the information from the NMI page fault and not the
+ * origin fault. Save it off and restore it if it changes.
+ */
+ movq %cr2, %rdx
+ pushq_cfi %rdx
+
/* Do not pop rdx, nested NMIs will corrupt that part of the stack */
- movq (%rsp), %rdx
+ movq 8(%rsp), %rdx
CFI_RESTORE rdx

/* Leave room for the "in NMI" variable. */
@@ -1682,7 +1696,7 @@ first_nmi:

/* Copy the stack frame to the Saved frame */
.rept 5
- pushq_cfi 6*8(%rsp)
+ pushq_cfi 7*8(%rsp)
.endr
CFI_DEF_CFA_OFFSET SS+8-RIP

@@ -1734,6 +1748,13 @@ end_repeat_nmi:
nmi_swapgs:
SWAPGS_UNSAFE_STACK
nmi_restore:
+ /* Test if the cr2 reg changed */
+ movq ORIG_RAX-R15+(12*8)(%rsp), %rdx
+ movq %cr2, %rcx
+ cmp %rdx, %rcx
+ je 1f
+ movq %rdx, %cr2
+1:
RESTORE_ALL 8
/* Clear the NMI executing stack variable */
movq $0, 10*8(%rsp)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/