RE: [PATCH v3 22/22] kvm: x86: Disable interception for IA32_XFD on demand

From: Tian, Kevin
Date: Wed Dec 29 2021 - 20:28:16 EST


> From: Sean Christopherson <seanjc@xxxxxxxxxx>
> Sent: Thursday, December 30, 2021 1:26 AM
>
> On Wed, Dec 29, 2021, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Wednesday, December 29, 2021 11:35 AM
> > > >
> > > > Speaking of nested, interception of #NM in
> > > vmx_update_exception_bitmap()
> > > > is wrong
> > > > with respect to nested guests. Until XFD is supported for L2, which I
> didn't
> > > > see
> > > > in this series, #NM should not be intercepted while L2 is running.
> > >
> > > Can you remind what additional thing is required to support XFD for L2?
> >
> > ok, at least we require additional work on when to disable write
> interception.
> > It can be done only when both L0 and L1 make the same decision, just like
> > what we did for many other VMCS settings...
>
> I'm not terribly concerned with exposing XFD to L2 right now, I'm much more
> concerned with not breaking nVMX when XFD is exposed to L1.
>
> > > If only about performance I prefer to the current conservative approach
> > > as the first step. As explained earlier, #NM should be rare if the guest
> > > doesn't run AMX applications at all. Adding nested into this picture
> doesn't
> > > make things a lot worser.
> >
> > All your comments incorporated except this one. As said always trapping
> > #NM for L2 is not a big problem, as long as it's rare and don't break
> function.
> > Usually a relatively static scheme is safer than a dynamic one in case of
> > anything overlooked for the initial implementation. 😊
>
> I strongly disagree, it's not automagically safer. Either way, we need to
> validate
> that KVM does the correct thing with respect to vmcs12. E.g. does KVM
> correctly
> reflect the #NM back to L2 when it's not intercepted by L1? Does it
> synthesize a
> nested VM-Exit when it is intercepted by L1? The testing required doesn't
> magically
> go away.
>
> As posted, there is zero chance that the patches correctly handling #NM in L2
> because nested_vmx_l0_wants_exit() doesn't prevent an #NM from being
> forwarded
> to L1. E.g. if L0 (KVM) and L1 both intercept #NM, handle_exception_nm()
> will
> queue a #NM for injection and then syntehsizea nested VM-Exit, i.e. the #NM
> from
> L2 will get injected into L1 after the nested exit.

Yes, it's incorrect behavior.

>
> That also means handle_exception_nmi_irqoff() => handle_exception_nm()
> is fatally
> broken on non-XFD hardware, as it will attempt RDMSR(MSR_IA32_XFD_ERR)
> if L1
> intercepts #NM since handle_exception_nmi_irqoff() will run before
> __vmx_handle_exit() => nested_vmx_reflect_vmexit() checks whether L0 or
> L1 should
> handle the exit.

Ditto. Thanks for pointing out those facts that we obviously overlooked.

So handle_exception_nm() should neither blindly read xfd_err (#NM might be
caused by L1 interception on a non-xfd platform) nor blindly queue a #NM
exception (triggered in L2) to L1 which intercepts #NM (then expects vm-exit)

The former suggests that reading xfd_err should be made conditionally
similar to what we did in vmx_update_exception_bitmap(). The latter
suggests the actual exception is better postponed until __vmx_handle_exit().

We are looking at this change now.

And once #NM handler works correctly to handle interception by either L0
or L1, I'm not sure whether we still want to special case L1 vs. L2 in
vmx_update_exception_bitmap(), since in the end L0 wants to intercept
#NM to save xfd_err for both L1 and L2 (if exposed with xfd capability by L1).

Thanks
Kevin