Re: [PATCH] x86/entry/64: De-Xen-ify our NMI code further

From: Andy Lutomirski
Date: Sun Jan 24 2021 - 22:03:19 EST


On Sun, Jan 24, 2021 at 5:13 PM Lai Jiangshan <jiangshanlai@xxxxxxxxx> wrote:
>
> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>
> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
> the NMI code by changing paravirt code into native code and left a comment
> about "inspecting RIP instead". But until now, "inspecting RIP instead"
> has not been made happened and this patch tries to complete it.
>
> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> ---
> arch/x86/entry/entry_64.S | 46 +++++++++++----------------------------
> 1 file changed, 13 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cad08703c4ad..cb6b8a6c6652 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1268,32 +1268,12 @@ SYM_CODE_START(asm_exc_nmi)
> je nested_nmi
>
> /*
> - * Now test if the previous stack was an NMI stack. This covers
> - * the case where we interrupt an outer NMI after it clears
> - * "NMI executing" but before IRET. We need to be careful, though:
> - * there is one case in which RSP could point to the NMI stack
> - * despite there being no NMI active: naughty userspace controls
> - * RSP at the very beginning of the SYSCALL targets. We can
> - * pull a fast one on naughty userspace, though: we program
> - * SYSCALL to mask DF, so userspace cannot cause DF to be set
> - * if it controls the kernel's RSP. We set DF before we clear
> - * "NMI executing".
> + * Now test if we interrupt an outer NMI after it clears
> + * "NMI executing" but before iret.

s/interrupt/interrupted

But let's make it a lot more clear:


Now test if we interrupted an outer NMI that just cleared "NMI
executing" and is about to IRET. This is a single-instruction window.
This check does not handle the case in which we get a nested interrupt
(#MC, #VE, #VC, etc.) after clearing "NMI executing" but before the
outer NMI executes IRET.

> + movq $nmi_executing_cleared, %rdx
> + cmpq 8(%rsp), %rdx
> + jne first_nmi

If we're okay with non-PIC code, then this is suboptimal -- you can
just compare directly. But using PIC is polite, so that movq should
be a RIP-relative leaq.

>
> /* This is a nested NMI. */
>
> @@ -1438,16 +1418,16 @@ nmi_restore:
> addq $6*8, %rsp
>
> /*
> - * Clear "NMI executing". Set DF first so that we can easily
> - * distinguish the remaining code between here and IRET from
> - * the SYSCALL entry and exit paths.
> - *
> - * We arguably should just inspect RIP instead, but I (Andy) wrote
> - * this code when I had the misapprehension that Xen PV supported
> - * NMIs, and Xen PV would break that approach.
> + * Clear "NMI executing". It also leaves a window after it before
> + * iret which should be also considered to be "NMI executing" albeit
> + * with "NMI executing" variable being zero. So we should also check
> + * the RIP after it when checking "NMI executing". See the code
> + * before nested_nmi. No code is allowed to be added to between
> + * clearing "NMI executing" and iret unless we check a larger window
> + * with a range of RIPs instead of currently a single-RIP window.

Let's simplify this comment:

Clear "NMI executing". This leaves a window in which a nested NMI
could observe "NMI executing" cleared, and a nested NMI will detect
this by inspecting RIP.

> */
> - std
> movq $0, 5*8(%rsp) /* clear "NMI executing" */
> +nmi_executing_cleared:
>

This should be local. Let's call it .Lnmi_iret. And add a comment:

.Lnmi_iret: /* must be immediately after clearing "NMI executing" */

> /*
> * iretq reads the "iret" frame and exits the NMI stack in a