Re: [PATCH v5 05/15] KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported value

From: Sean Christopherson
Date: Thu Nov 03 2022 - 12:53:24 EST


On Wed, Nov 02, 2022, Yu Zhang wrote:
> On Tue, Nov 01, 2022 at 05:58:21PM +0000, Sean Christopherson wrote:
> > On Tue, Nov 01, 2022, Yu Zhang wrote:
> > > What I observed is that vmx->nested.msrs.secondary_ctls_high will be changed
> > > in vmx_adjust_secondary_exec_control(), which can be triggered after cpuid is
> > > set.
> >
> > Ugh, that path got overlooked when we yanked out KVM's manipulaton of VMX MSRs
> > in response to guest CPUID changes. I wonder if we can get away with changing
> > KVM's behavior to only ensure a feature isn't exposed to L2 when it's not exposed
> > to L1.
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 6b4266e949a3..cfc35d559d91 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -4523,8 +4523,8 @@ vmx_adjust_secondary_exec_control(struct vcpu_vmx *vmx, u32 *exec_control,
> > * Update the nested MSR settings so that a nested VMM can/can't set
> > * controls for features that are/aren't exposed to the guest.
> > */
> > - if (nested) {
> > - if (enabled)
> > + if (nested && !enabled)
> > + if (exiting)
> > vmx->nested.msrs.secondary_ctls_high |= control;
> > else
> > vmx->nested.msrs.secondary_ctls_high &= ~control;
> >
>
> Indeed, this change can make sure a feature won't be exposed to L2, if L1
> does not have it.

No, that's not the goal of the change. KVM already hides features in the VMX MSRs
if the base feature is not supported in L1 according to guest CPUID. The problem
is that, currently, KVM also _forces_ features to be enabled in the VMX MSRs when
the base feature IS supported in L1 (CPUID).

Ideally, KVM should NEVER manipulate VMX MSRs in response to guest CPUID changes.
That's what I was referring to earlier by commits:

8805875aa473 ("Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"")
9389d5774aca ("Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL"")

E.g. if userspace sets VMX MSRs and then sets guest CPUID, KVM will override the
nVMX CPU model defined by userspace. The scenario where userspace hides a "base"
feature but exposes the feature in the VMX MSRs is nonsensical, which is why I
think KVM can likely get away with force-disabling features.

The reverse is completely legitimate though: hiding a feature in VMX MSRs even if
the base feature is supported for L1, i.e. disallowing L1 from enabling the feature
in L2, is something that real VMMs will actually do, e.g. if the user doesn't trust
that KVM correctly handles all aspects of nested virtualization for the feature.

In other words, the behavior you're observing, where vmx->nested.msrs.secondary_ctls_high
is changed by vmx_adjust_secondary_exec_control(), is a completely separate bug
than the one below.

> But for the feature bits that L1 has, yet cleared from the
> vmcs_conf->nested.msrs.secondary_ctls_high in nested_vmx_setup_ctls_msrs(),
> there's no chance for userspace vmm to reset it again.
>
> Well, I am not suggesting to give userspace vmm such permission(which I believe
> is incorrect). And IIUC, vmcs_conf->nested.msrs.secondary_ctls_high shall serve
> as a template to initialize vmx->nested.msrs.secondary_ctls_high. So maybe we
> shall not mask off some features in nested_vmx_setup_ctls_msrs() at the beginning.
>
> > > Since KVM's config(vmcs_config->nested.secondary_ctls_high) is done during init
> > > by nested_vmx_setup_ctls_msrs(), which only kept a subset of the flags from the
> > > vmcs_confg->cpu_based_2nd_exec_ctrl, the vmx_restore_control_msr() could fail
> > > later, when userspace VMM tries to enable a feature(the only one I witnessed is
> > > SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE) by setting MSR_IA32_VMX_PROCBASED_CTLS2.
> > > Because the vmx->nested.msrs.secondary_ctls_high is updated by cpuid, but this
> > > bit is not taken from vmcs_conf->cpu_based_2nd_exec_ctrl by nested_vmx_setup_ctls_msrs()
> > > for vmcs_config->nested.secondary_ctls_high.
> > >
> > > The failure can be fixed, simply by adding SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE in
> > > nested_vmx_setup_ctls_msrs(), e.g.
> >
> > Assuming KVM actually supports user wait/pause in L2, this is an orthogonal bug
> > to the CPUID-based manipulation above. KVM simply neglects to advertise to userspace
> > that ENABLE_USR_WAIT_PAUSE is supported for nested virtualization.
> >
> > If KVM doesn't correctly support virtualizing user wait/pause for L2, then the
> > correct location to fix this is in vmx_secondary_exec_control().
> >
>
> Sorry, why vmx_secondary_exec_control()?

You missed the qualifier:

If KVM doesn't correctly support virtualizing user wait/pause for L2

If KVM should NOT be exposing ENABLE_USR_WAIT_PAUSE to the L1 VMM, then NOT
propagating the feature to msrs->secondary_ctls_low is correct. And if that's
the case, then vmx_secondary_exec_control() needs to be modified so that it does
NOT set ENABLE_USR_WAIT_PAUSE in vmx->nested.msrs.secondary_ctls_high.

> Could we just change nested_vmx_setup_ctls_msrs() like below:

If KVM correctly virtualizes the feature in a nested scenario, yes. I haven't
looked into ENABLE_USR_WAIT_PAUSE enough to know whether or not KVM gets the
nested virtualization pieces correct, hence the above qualifier.

> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 0c62352dda6a..fa255391718c 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -6791,13 +6791,7 @@ void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps)
> msrs->procbased_ctls_low &=
> ~(CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING);
>
> - /*
> - * secondary cpu-based controls. Do not include those that
> - * depend on CPUID bits, they are added later by
> - * vmx_vcpu_after_set_cpuid.
> - */
> msrs->secondary_ctls_low = 0;
> -
> msrs->secondary_ctls_high = vmcs_conf->cpu_based_2nd_exec_ctrl;
> msrs->secondary_ctls_high &=
> SECONDARY_EXEC_DESC |
> @@ -6810,7 +6804,8 @@ void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps)
> SECONDARY_EXEC_ENABLE_INVPCID |
> SECONDARY_EXEC_RDSEED_EXITING |
> SECONDARY_EXEC_XSAVES |
> - SECONDARY_EXEC_TSC_SCALING;
> + SECONDARY_EXEC_TSC_SCALING |
> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
>
> /*
> * We can emulate "VMCS shadowing," even if the hardware