Re: [PATCH v2 2/3] KVM: X86: Provide a capability to disable cstate msr read intercepts

From: Paolo Bonzini
Date: Tue Jun 04 2019 - 12:57:35 EST


On 21/05/19 08:06, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
>
> Allow guest reads CORE cstate when exposing host CPU power management capabilities
> to the guest. PKG cstate is restricted to avoid a guest to get the whole package
> information in multi-tenant scenario.
>
> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Cc: Radim KrÄmÃÅ <rkrcmar@xxxxxxxxxx>
> Cc: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Cc: Liran Alon <liran.alon@xxxxxxxxxx>
> Signed-off-by: Wanpeng Li <wanpengli@xxxxxxxxxxx>
> ---
> v1 -> v2:
> * use a separate bit for KVM_CAP_X86_DISABLE_EXITS
>
> Documentation/virtual/kvm/api.txt | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/vmx/vmx.c | 6 ++++++
> arch/x86/kvm/x86.c | 5 ++++-
> arch/x86/kvm/x86.h | 5 +++++
> include/uapi/linux/kvm.h | 4 +++-
> tools/include/uapi/linux/kvm.h | 4 +++-
> 7 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 33cd92d..91fd86f 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4894,6 +4894,7 @@ Valid bits in args[0] are
> #define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0)
> #define KVM_X86_DISABLE_EXITS_HLT (1 << 1)
> #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2)
> +#define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3)
>
> Enabling this capability on a VM provides userspace with a way to no
> longer intercept some instructions for improved latency in some
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d5457c7..1ce8289 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -882,6 +882,7 @@ struct kvm_arch {
> bool mwait_in_guest;
> bool hlt_in_guest;
> bool pause_in_guest;
> + bool cstate_in_guest;
>
> unsigned long irq_sources_bitmap;
> s64 kvmclock_offset;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0861c71..da24f18 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6637,6 +6637,12 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
> + if (kvm_cstate_in_guest(kvm)) {
> + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C1_RES, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C3_RESIDENCY, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);

I think I have changed my mind on the implementation of this, sorry.

1) We should emulate these MSRs always, otherwise the guest API changes
between different values of KVM_CAP_X86_DISABLE_EXITS which is not
intended. Also, KVM_CAP_X86_DISABLE_EXITS does not prevent live
migration, so it should be possible to set the MSRs in the host to
change the delta between the host and guest values.

2) If both KVM_X86_DISABLE_EXITS_HLT and KVM_X86_DISABLE_EXITS_MWAIT are
disabled (i.e. exit happens), the MSRs will be purely emulated.
C3/C6/C7 residency will never increase (it will remain the value that is
set by the host). When the VM executes an hlt vmexit, it should save
the current TSC. When it comes back, the C1 residency MSR should be
increased by the time that has passed.

3) If KVM_X86_DISABLE_EXITS_HLT is enabled but
KVM_X86_DISABLE_EXITS_MWAIT is disabled (i.e. mait exits happen),
C3/C6/C7 residency will also never increase, but the C1 residency value
should be read using rdmsr from the host, with a delta added from the
host value.

4) If KVM_X86_DISABLE_EXITS_HLT and KVM_X86_DISABLE_EXITS_MWAIT are both
disabled (i.e. mwait exits do not happen), all four residency values
should be read using rdmsr from the host, with a delta added from the
host value.

5) If KVM_X86_DISABLE_EXITS_HLT is disabled and
KVM_X86_DISABLE_EXITS_MWAIT is enabled, the configuration makes no sense
so it's okay not to be very optimized. In this case, the residency
value should be read as in (4), but hlt vmexits will be accounted as in
(2) so we need to be careful not to double-count the residency during
hlt. This means doing four rdmsr before the beginning of the hlt vmexit
and four at the end of the hlt vmexit.

Therefore the data structure should be something like

struct kvm_residency_msr {
u64 value;
bool delta_from_host;
bool count_with_host;
}

u64 kvm_residency_read_host(struct kvm_residency_msr *msr)
{
u64 unscaled_value = rdmsrl(msr->index);
// apply TSC scaling...
return ...
}

u64 kvm_residency_read(struct kvm_residency_msr *msr)
{
return msr->value +
(msr->delta_from_host ? kvm_residency_read_host(msr) : 0);
}

void kvm_residency_write(struct kvm_residency_msr *msr,
u64 value)
{
msr->value = value -
(msr->delta_from_host ? kvm_residency_read_host(msr) : 0);
}

// count_with_host is true for C1 iff any of KVM_CAP_DISABLE_EXITS_HLT
// or KVM_CAP_DISABLE_EXITS_MWAIT is set
// count_with_host is true for C3/C6/C7 iff KVM_CAP_DISABLE_EXITS_MWAIT
is set
void kvm_residency_setup(struct kvm_residency_msr *msr, u16 index,
bool count_with_host)
{
/* Preserve value on calls after the first */
u64 value = msr->index ? kvm_residency_read(msr) : 0;
msr->delta_from_host = msr->count_with_host = count_with_host;
msr->index = index;
kvm_residency_write(msr, value);
}

// The following functions are called from hlt vmexits.

void kvm_residency_start_hlt(struct kvm_residency_msr *msr)
{
if (msr->count_with_host) {
WARN_ON(msr->delta_from_host);
msr->value += kvm_residency_read_host(msr);
msr->delta_from_host = false;
}
}

// host_tsc_waited is 0 except for MSR_CORE_C1_RES
void kvm_residency_end_hlt(struct kvm_residency_msr *msr,
u64 host_tsc_waited)
{
if (msr->count_with_host) {
WARN_ON(!msr->delta_from_host);
msr->value -= kvm_residency_read_host(msr);
msr->delta_from_host = true;
}
if (host_tsc_waited) {
// ... apply TSC scaling to host_tsc_waited ...
msr->value += ...;
}
}

Thanks,

Paolo