Re: [RFC PATCH] x86/kvm: Allow kvm_read_and_reset_apf_flags() to be instrumented

From: Sean Christopherson
Date: Thu Dec 30 2021 - 13:40:54 EST


On Fri, Nov 26, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>
> Both VMX and SVM made mistakes of calling kvm_read_and_reset_apf_flags()
> in instrumentable code:
> vmx_vcpu_enter_exit()
> ASYNC PF induced VMEXIT
> save cr2
> leave noinstr section
> -- kernel #PF can happen here due to instrumentation
> handle_exception_nmi_irqoff
> kvm_read_and_reset_apf_flags()
>
> If kernel #PF happens before handle_exception_nmi_irqoff() after leaving
> noinstr section due to instrumentation, kernel #PF handler will consume
> the incorrect apf flags and panic.
>
> The problem might be fixed by moving kvm_read_and_reset_apf_flags()
> into vmx_vcpu_enter_exit() to consume apf flags earlier before leaving
> noinstr section like the way it handles CR2.
>
> But linux kernel only resigters ASYNC PF for userspace and guest, so

I'd omit the guest part. While technically true, it's not relevant to the change
as the instrumentation safety comes purely from the user_mode() check. Mentioning
the "guest" side of things gets confusing as the "host" may be an L1 kernel, in
which case it is also a guest and may have its own guests.

It'd also be helpful for future readers to call out that this is true only since
commit 3a7c8fafd1b4 ("x86/kvm: Restrict ASYNC_PF to user space"). Allowing async
#PF in kernel mode was presumably why this code was non-instrumentable in the
first place.

> ASYNC PF can't hit when it is in kernel, so apf flags can be changed to
> be consumed only when the #PF is from guest or userspace.

...

> @@ -1538,7 +1514,20 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
> state = irqentry_enter(regs);
>
> instrumentation_begin();
> - handle_page_fault(regs, error_code, address);
> +
> + /*
> + * KVM uses #PF vector to deliver 'page not present' events to guests
> + * (asynchronous page fault mechanism). The event happens when a
> + * userspace task is trying to access some valid (from guest's point of
> + * view) memory which is not currently mapped by the host (e.g. the
> + * memory is swapped out). Note, the corresponding "page ready" event
> + * which is injected when the memory becomes available, is delivered via
> + * an interrupt mechanism and not a #PF exception
> + * (see arch/x86/kernel/kvm.c: sysvec_kvm_asyncpf_interrupt()).
> + */
> + if (!user_mode(regs) || !kvm_handle_async_user_pf(regs, (u32)address))
> + handle_page_fault(regs, error_code, address);

To preserve the panic() if an async #PF is injected for !user_mode, any reason
not to do?

if (!kvm_handle_async_user_pf(regs, (u32)address))
handle_page_fault(regs, error_code, address);

Then __kvm_handle_async_user_pf() can keep its panic() on !user_mode(). And it
also saves a conditional when kvm_async_pf_enabled is not true.