Re: [PATCH v2] KVM: X86: Emulate APERF/MPERF to report actual vCPU frequency

From: Jim Mattson
Date: Wed Dec 29 2021 - 21:36:48 EST


On Wed, Dec 29, 2021 at 4:28 PM Jim Mattson <jmattson@xxxxxxxxxx> wrote:
>
> On Tue, Dec 28, 2021 at 8:06 PM Like Xu <like.xu.linux@xxxxxxxxx> wrote:
> >
> > Hi Jim,
> >
> > Thanks for your detailed comments.
> >
> > On 29/12/2021 9:11 am, Jim Mattson wrote:
> > > On Wed, Dec 22, 2021 at 5:34 AM Like Xu <like.xu.linux@xxxxxxxxx> wrote:
> > >>
> > >> From: Like Xu <likexu@xxxxxxxxxxx>
> > >>
> > >> The aperf/mperf are used to report current CPU frequency after 7d5905dc14a.
> > >> But guest kernel always reports a fixed vCPU frequency in the /proc/cpuinfo,
> > >> which may confuse users especially when turbo is enabled on the host or
> > >> when the vCPU has a noisy high power consumption neighbour task.
> > >>
> > >> Most guests such as Linux will only read accesses to AMPERF msrs, where
> > >> we can passthrough registers to the vcpu as the fast-path (a performance win)
> > >> and once any write accesses are trapped, the emulation will be switched to
> > >> slow-path, which emulates guest APERF/MPERF values based on host values.
> > >> In emulation mode, the returned MPERF msr value will be scaled according
> > >> to the TSCRatio value.
> > >>
> > >> As a minimum effort, KVM exposes the AMPERF feature when the host TSC
> > >> has CONSTANT and NONSTOP features, to avoid the need for more code
> > >> to cover various coner cases coming from host power throttling transitions.
> > >>
> > >> The slow path code reveals an opportunity to refactor update_vcpu_amperf()
> > >> and get_host_amperf() to be more flexible and generic, to cover more
> > >> power-related msrs.
> > >>
> > >> Requested-by: Dongli Cao <caodongli@xxxxxxxxxxxx>
> > >> Requested-by: Li RongQing <lirongqing@xxxxxxxxx>
> > >> Signed-off-by: Like Xu <likexu@xxxxxxxxxxx>
> > >> ---
> > >> v1 -> v2 Changelog:
> > >> - Use MSR_TYPE_R to passthrough as a fast path;
> > >> - Use [svm|vmx]_set_msr for emulation as a slow path;
> > >> - Interact MPERF with TSC scaling (Jim Mattson);
> > >> - Drop bool hw_coord_fb_cap with cpuid check;
> > >> - Add TSC CONSTANT and NONSTOP cpuid check;
> > >> - Duplicate static_call(kvm_x86_run) to make the branch predictor happier;
> > >>
> > >> Previous:
> > >> https://lore.kernel.org/kvm/20200623063530.81917-1-like.xu@xxxxxxxxxxxxxxx/
> > >>
> > >> arch/x86/include/asm/kvm_host.h | 12 +++++
> > >> arch/x86/kvm/cpuid.c | 3 ++
> > >> arch/x86/kvm/cpuid.h | 22 +++++++++
> > >> arch/x86/kvm/svm/svm.c | 15 ++++++
> > >> arch/x86/kvm/svm/svm.h | 2 +-
> > >> arch/x86/kvm/vmx/vmx.c | 18 ++++++-
> > >> arch/x86/kvm/x86.c | 85 ++++++++++++++++++++++++++++++++-
> > >> 7 files changed, 153 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > >> index ce622b89c5d8..1cad3992439e 100644
> > >> --- a/arch/x86/include/asm/kvm_host.h
> > >> +++ b/arch/x86/include/asm/kvm_host.h
> > >> @@ -39,6 +39,8 @@
> > >>
> > >> #define KVM_MAX_VCPUS 1024
> > >>
> > >> +#define KVM_MAX_NUM_HWP_MSR 2
> > >> +
> > >> /*
> > >> * In x86, the VCPU ID corresponds to the APIC ID, and APIC IDs
> > >> * might be larger than the actual number of VCPUs because the
> > >> @@ -562,6 +564,14 @@ struct kvm_vcpu_hv_stimer {
> > >> bool msg_pending;
> > >> };
> > >>
> > >> +/* vCPU thermal and power context */
> > >> +struct kvm_vcpu_hwp {
> > >> + bool fast_path;
> > >> + /* [0], APERF msr, increases with the current/actual frequency */
> > >> + /* [1], MPERF msr, increases with a fixed frequency */
> > >
> > > According to the SDM, volume 3, section 18.7.2,
> > > * The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at close to the
> > > maximum non-turbo frequency, which is equal to the product of scalable
> > > bus frequency and maximum non-turbo ratio.
> >
> > For AMD, it will be the P0 frequency.
> >
> > >
> > > It's important to note that IA32_MPERF operates at close to the same
> > > frequency of the TSC. If that were not the case, your comment
> > > regarding IA32_APERF would be incorrect.
> >
> > Yes, how does this look:
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f8f978bc9ec3..d422bf8669ca 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -568,7 +568,7 @@ struct kvm_vcpu_hv_stimer {
> > struct kvm_vcpu_hwp {
> > bool fast_path;
> > /* [0], APERF msr, increases with the current/actual frequency */
> > - /* [1], MPERF msr, increases with a fixed frequency */
> > + /* [1], MPERF msr, increases at the same fixed frequency as the TSC */
> > u64 msrs[KVM_MAX_NUM_HWP_MSR];
> > };
>
> That looks fine from the Intel perspective. (Note that I have not
> looked at AMD's documentation yet.)
>
> > >
> > > For example, suppose that the TSC frequency were 2.0 GHz, the
> > > current/actual frequency were 2.2 GHz, and the IA32_MPERF frequency
> > > were 133 MHz. In that case, the IA32_APERF MSR would increase at 146.3
> > > MHz.
> > >
> >
> > >> + u64 msrs[KVM_MAX_NUM_HWP_MSR];
> > >> +};
> > >> +
> > >> /* Hyper-V synthetic interrupt controller (SynIC)*/
> > >> struct kvm_vcpu_hv_synic {
> > >> u64 version;
> > >> @@ -887,6 +897,8 @@ struct kvm_vcpu_arch {
> > >> /* AMD MSRC001_0015 Hardware Configuration */
> > >> u64 msr_hwcr;
> > >>
> > >> + struct kvm_vcpu_hwp hwp;
> > >> +
> > >> /* pv related cpuid info */
> > >> struct {
> > >> /*
> > >> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > >> index 0b920e12bb6d..e20e5e8c2b3a 100644
> > >> --- a/arch/x86/kvm/cpuid.c
> > >> +++ b/arch/x86/kvm/cpuid.c
> > >> @@ -739,6 +739,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> > >> entry->eax = 0x4; /* allow ARAT */
> > >> entry->ebx = 0;
> > >> entry->ecx = 0;
> > >> + /* allow aperf/mperf to report the true vCPU frequency. */
> > >> + if (kvm_cpu_cap_has_amperf())
> > >> + entry->ecx |= (1 << 0);
> > >> entry->edx = 0;
> > >> break;
> > >> /* function 7 has additional index. */
> > >> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> > >> index c99edfff7f82..741949b407b7 100644
> > >> --- a/arch/x86/kvm/cpuid.h
> > >> +++ b/arch/x86/kvm/cpuid.h
> > >> @@ -154,6 +154,28 @@ static inline int guest_cpuid_stepping(struct kvm_vcpu *vcpu)
> > >> return x86_stepping(best->eax);
> > >> }
> > >>
> > >> +static inline bool kvm_cpu_cap_has_amperf(void)
> > >> +{
> > >> + return boot_cpu_has(X86_FEATURE_APERFMPERF) &&
> > >> + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC);
> > >> +}
> > >> +
> > >> +static inline bool guest_support_amperf(struct kvm_vcpu *vcpu)
> > >> +{
> > >> + struct kvm_cpuid_entry2 *best;
> > >> +
> > >> + if (!kvm_cpu_cap_has_amperf())
> > >> + return false;
> > >> +
> > >> + best = kvm_find_cpuid_entry(vcpu, 0x6, 0);
> > >> + if (!best || !(best->ecx & 0x1))
> > >> + return false;
> > >> +
> > >> + best = kvm_find_cpuid_entry(vcpu, 0x80000007, 0);
> > >> + return best && (best->edx & (1 << 8));
> > > Nit: Use BIT().
> >
> > Applied.
> >
> > >> +}
> > >> +
> > >> static inline bool guest_has_spec_ctrl_msr(struct kvm_vcpu *vcpu)
> > >> {
> > >> return (guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL) ||
> > >> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > >> index 5557867dcb69..2873c7f132bd 100644
> > >> --- a/arch/x86/kvm/svm/svm.c
> > >> +++ b/arch/x86/kvm/svm/svm.c
> > >> @@ -114,6 +114,8 @@ static const struct svm_direct_access_msrs {
> > >> { .index = MSR_EFER, .always = false },
> > >> { .index = MSR_IA32_CR_PAT, .always = false },
> > >> { .index = MSR_AMD64_SEV_ES_GHCB, .always = true },
> > >> + { .index = MSR_IA32_MPERF, .always = false },
> > >> + { .index = MSR_IA32_APERF, .always = false },
> > >> { .index = MSR_INVALID, .always = false },
> > >> };
> > >>
> > >> @@ -1218,6 +1220,12 @@ static inline void init_vmcb_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >> /* No need to intercept these MSRs */
> > >> set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 1, 1);
> > >> set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 1, 1);
> > >> +
> > >> + if (guest_support_amperf(vcpu)) {
> > >> + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_MPERF, 1, 0);
> > >> + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_APERF, 1, 0);
> > >> + vcpu->arch.hwp.fast_path = true;
> > >> + }
> > >> }
> > >> }
> > >>
> > >> @@ -3078,6 +3086,13 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> > >> svm->msr_decfg = data;
> > >> break;
> > >> }
> > >> + case MSR_IA32_APERF:
> > >> + case MSR_IA32_MPERF:
> > >> + if (vcpu->arch.hwp.fast_path) {
> > >> + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_MPERF, 0, 0);
> > >> + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_APERF, 0, 0);
> > >> + }
> > >> + return kvm_set_msr_common(vcpu, msr);
> > >> default:
> > >> return kvm_set_msr_common(vcpu, msr);
> > >> }
> > >> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > >> index 9f153c59f2c8..ad4659811620 100644
> > >> --- a/arch/x86/kvm/svm/svm.h
> > >> +++ b/arch/x86/kvm/svm/svm.h
> > >> @@ -27,7 +27,7 @@
> > >> #define IOPM_SIZE PAGE_SIZE * 3
> > >> #define MSRPM_SIZE PAGE_SIZE * 2
> > >>
> > >> -#define MAX_DIRECT_ACCESS_MSRS 20
> > >> +#define MAX_DIRECT_ACCESS_MSRS 22
> > >> #define MSRPM_OFFSETS 16
> > >> extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
> > >> extern bool npt_enabled;
> > >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > >> index 1d53b8144f83..8998042107d2 100644
> > >> --- a/arch/x86/kvm/vmx/vmx.c
> > >> +++ b/arch/x86/kvm/vmx/vmx.c
> > >> @@ -576,6 +576,9 @@ static bool is_valid_passthrough_msr(u32 msr)
> > >> case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
> > >> case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> > >> /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> > >> + case MSR_IA32_MPERF:
> > >> + case MSR_IA32_APERF:
> > >> + /* AMPERF MSRs. These are passthrough when all access is read-only. */
> > >
> > > Even if all accesses are read-only, these MSRs cannot be pass-through
> > > when the 'Use TSC scaling' VM-execution control is set and the TSC
> > > multiplier is anything other than 1.
> >
> > If all accesses are read-only, rdmsr will not be trapped and in that case:
> >
> > The value read is scaled by the TSCRatio value (MSR C000_0104h) for
> > guest reads, but the underlying counters are not affected. Reads in host
> > mode or writes to MPERF are not affected. [AMD APM 17.3.2]
>
> It's nice of AMD to scale reads of IA32_MPERF. That certainly
> simplifies the problem of virtualizing these MSRs. However, Intel is
> not so kind.
>
> > >
> > > Suppose, for example, that the vCPU has a TSC frequency of 2.2 GHz,
> > > but it is running on a host with a TSC frequency of 2.0 GHz. The
> > > effective IA32_MPERF frequency should be the same as the vCPU TSC
> > > frequency (scaled by the TSC multiplier), rather than the host
> > > IA32_MPERF frequency.
> >
> > I guess that Intel's implementation will also imply the effect of
> > the TSC multiplier for guest reads. Please let me know if I'm wrong.
>
> From the description of the "Use TSC scaling" VM-execution control in
> Table 23-7: "This control determines whether executions of RDTSC,
> executions of RDTSCP, and executions of RDMSR that read from the
> IA32_TIME_STAMP_COUNTER MSR return a value modified by the TSC
> multiplier field (see Section 23.6.5 and Section 24.3)."
>
> If you want to scale guest reads of IA32_MPERF, you will have to
> intercept them and perform the scaling in software.
>
> > >
> > >> return true;
> > >> }
> > >>
> > >> @@ -2224,7 +2227,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >> }
> > >> ret = kvm_set_msr_common(vcpu, msr_info);
> > >> break;
> > >> -
> > >> + case MSR_IA32_APERF:
> > >> + case MSR_IA32_MPERF:
> > >> + if (vcpu->arch.hwp.fast_path) {
> > >> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_RW, true);
> > >> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_RW, true);
> > >> + }
> > >> + ret = kvm_set_msr_common(vcpu, msr_info);
> > >> + break;
> > >> default:
> > >> find_uret_msr:
> > >> msr = vmx_find_uret_msr(vmx, msr_index);
> > >> @@ -6928,6 +6938,12 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
> > >> vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
> > >> }
> > >>
> > >> + if (guest_support_amperf(vcpu)) {
> > >> + vmx_disable_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_R);
> > >> + vmx_disable_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_R);
> > >> + vcpu->arch.hwp.fast_path = true;
> > >> + }
> > >> +
> > >> vmx->loaded_vmcs = &vmx->vmcs01;
> > >>
> > >> if (cpu_need_virtualize_apic_accesses(vcpu)) {
> > >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > >> index 42bde45a1bc2..7a6355815493 100644
> > >> --- a/arch/x86/kvm/x86.c
> > >> +++ b/arch/x86/kvm/x86.c
> > >> @@ -1376,6 +1376,8 @@ static const u32 msrs_to_save_all[] = {
> > >> MSR_F15H_PERF_CTL3, MSR_F15H_PERF_CTL4, MSR_F15H_PERF_CTL5,
> > >> MSR_F15H_PERF_CTR0, MSR_F15H_PERF_CTR1, MSR_F15H_PERF_CTR2,
> > >> MSR_F15H_PERF_CTR3, MSR_F15H_PERF_CTR4, MSR_F15H_PERF_CTR5,
> > >> +
> > >> + MSR_IA32_APERF, MSR_IA32_MPERF,
> > >> };
> > >>
> > >> static u32 msrs_to_save[ARRAY_SIZE(msrs_to_save_all)];
> > >> @@ -3685,6 +3687,16 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >> return 1;
> > >> vcpu->arch.msr_misc_features_enables = data;
> > >> break;
> > >> + case MSR_IA32_APERF:
> > >> + case MSR_IA32_MPERF:
> > >> + /* Ignore meaningless value overrides from user space.*/
> > >> + if (msr_info->host_initiated)
> > >> + return 0;
> > >
> > > Without these meaningless overrides from userspace, how do we ensure
> > > that the guest derives the correct IA32_APERF/IA32_MPERF ratio for a
> >
> > The guest cares about the ratio of the two deltas rather than APERF/MPERF ratio.
> >
> > Effective frequency = {(APERF − APERF_INIT) / (MPERF − MPERF_INIT)} * P0 frequency
>
> My question was, "How do you ensure the deltas are correct when
> APERF_INIT and MPERF_INIT are sampled before live migration and APERF
> and MPERF are sampled after live migration?" (Using your equation
> above.)
>
> > > set of measurements that span a live migration? For that matter, how
> > > do we ensure that the deltas are even positive?
> >
> > Once we allow the user space to restore AMPERF msr values different from
> > the host values, the slow path will be walked and we try to avoid this kind
> > of case due to overhead, whatever for live migration or pCPU migration.
>
> Nonetheless, your implementation does not work.
>
> > >
> > > For example, suppose that the VM has migrated from a host with an
> > > IA32_MPERF value of 0x0000123456789abc to a host with an IA32_MPERF
> > > value of 0x000000123456789a. If the guest sampled IA32_MPERF before
> > > and after live migration, it would see the counter go backwards, which
> >
> > Yes, it will happen since without more hints from KVM, the user space
> > can't be sure if the save/restore time is in the sample period of AMPERF.
> > And even worse, guest could manipulate reading order of the AMPERF.
> >
> > The proposal is to *let it happen* because it causes no harm, in the meantime,
> > what the guest really cares about is the deltas ratio, not the accuracy of
> > individual msr values, and if the result in this sample is ridiculous, the guest
> > should go and pick the result from the next sample.
>
> You do not get to define the architecture. The CPU vendors have
> already done that. Your job is to adhere to the architectural
> specification.
>
> > Maybe we could add fault tolerance for AMPERF in the guest, something like
> > a retry mechnism or just discarding extreme values to follow statistical methods.
>
> That sounds like a parairtual approach to me. There is nothing in the
> architectural specification that suggests that such mechanisms are
> necessary.
>
> > The good news is the robustness like Linux guest on this issue is appreciated.
> > (9a6c2c3c7a73ce315c57c1b002caad6fcc858d0f and more stuff)
> >
> > Considering that the sampling period of amperf is relatively frequent compared
> > with the workload runtime and it statistically reports the right vCPU frequency,
> > do you think this meaningless proposal is acceptable or practicable ?
>
> My opinion is that your proposal is unacceptable, but I am not a decision maker.
>
> > > should not happen.
> > >
> > >> + if (!guest_support_amperf(vcpu))
> > >> + return 1;
> > >> + vcpu->arch.hwp.msrs[MSR_IA32_APERF - msr] = data;
> > >> + vcpu->arch.hwp.fast_path = false;
> > >> + break;
> > >> default:
> > >> if (kvm_pmu_is_valid_msr(vcpu, msr))
> > >> return kvm_pmu_set_msr(vcpu, msr_info);
> > >> @@ -4005,6 +4017,17 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >> case MSR_K7_HWCR:
> > >> msr_info->data = vcpu->arch.msr_hwcr;
> > >> break;
> > >> + case MSR_IA32_APERF:
> > >> + case MSR_IA32_MPERF: {
> > > ]> + u64 value;
> > >> +
> > >> + if (!msr_info->host_initiated && !guest_support_amperf(vcpu))
> > >> + return 1;
> > >> + value = vcpu->arch.hwp.msrs[MSR_IA32_APERF - msr_info->index];
> > >> + msr_info->data = (msr_info->index == MSR_IA32_APERF) ? value :
> > >> + kvm_scale_tsc(vcpu, value, vcpu->arch.tsc_scaling_ratio);
> > >
> > > I think it makes more sense to perform the scaling before storing the
> > > IA32_MPERF value in vcpu->arch.hwp.msrs[].
> >
> > Emm, do you really need to add more instruction cycles in the each call
> > of update_vcpu_amperf() in the critical path vcpu_enter_guest(), since the
> > calls to kvm_get_msr_commom() are relatively sparse.
>
> One possible alternative may be for kvm to take over the IA32_MPERF
> and IA32_APERF MSRs on sched-in. That may result in less overhead.
>
> > Will we get a functional error if we defer the kvm_scale_tsc() operation ?
>
> If you accumulate IA32_MPERF cycles from multiple hosts with different
> IA32_MPERF frequencies and you defer the kvm_scale_tsc operation,
> then, yes, this is broken.
>
> > >
> > >> + break;
> > >> + }
> > >> default:
> > >> if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
> > >> return kvm_pmu_get_msr(vcpu, msr_info);
> > >> @@ -9688,6 +9711,53 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
> > >> }
> > >> EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
> > >>
> > >> +static inline void get_host_amperf(u64 msrs[])
> > >> +{
> > >> + rdmsrl(MSR_IA32_APERF, msrs[0]);
> > >> + rdmsrl(MSR_IA32_MPERF, msrs[1]);
> > >> +}
> > >> +
> > >> +static inline u64 get_amperf_delta(u64 enter, u64 exit)
> > >> +{
> > >> + if (likely(exit >= enter))
> > >> + return exit - enter;
> > >> +
> > >> + return ULONG_MAX - enter + exit;
> > >> +}
> > >> +
> > >> +static inline void update_vcpu_amperf(struct kvm_vcpu *vcpu, u64 adelta, u64 mdelta)
> > >> +{
> > >> + u64 aperf_left, mperf_left, delta, tmp;
> > >> +
> > >> + aperf_left = ULONG_MAX - vcpu->arch.hwp.msrs[0];
> > >> + mperf_left = ULONG_MAX - vcpu->arch.hwp.msrs[1];
> > >> +
> > >> + /* Fast path when neither MSR overflows */
> > >> + if (adelta <= aperf_left && mdelta <= mperf_left) {
> > >> + vcpu->arch.hwp.msrs[0] += adelta;
> > >> + vcpu->arch.hwp.msrs[1] += mdelta;
> > >> + return;
> > >> + }
> > >> +
> > >> + /* When either MSR overflows, both MSRs are reset to zero and continue to increment. */
> > >> + delta = min(adelta, mdelta);
> > >> + if (delta > aperf_left || delta > mperf_left) {
> > >> + tmp = max(vcpu->arch.hwp.msrs[0], vcpu->arch.hwp.msrs[1]);
> > >> + tmp = delta - (ULONG_MAX - tmp) - 1;
> > >> + vcpu->arch.hwp.msrs[0] = tmp + adelta - delta;
> > >> + vcpu->arch.hwp.msrs[1] = tmp + mdelta - delta;
> > >> + return;
> > >> + }
> > >
> > > I don't believe that the math above is correct in the general case. It
> > > appears to assume that the counters are running at the same frequency.
> >
> > Are you saying that if the guest counter is not considered to be running
> > at the same frequency as the host, we need to wrap mdelta with
> > kvm_scale_tsc() to accumulate the mdelta difference for a vmentry/exit ?
>
> No. I just think your math/logic is wrong. Consider the following example:
>
> At time t0, IA32_MPERF is -1000, and IA32_APERF is -1999. At time t1,
> IA32_MPERF and IA32_APERF are both 1. Even assuming a constant CPU
> frequency between t0 and t1, the possible range of actual frequency
> are from half the TSC frequency to double the TSC frequency,
> exclusive. If IA32_APERF is counting at just over half the TSC
> frequency, then IA32_MPERF will hit 0 first. In this case, at t1, the
> MPERF delta will be 1001, and the APERF delta will be ~502. However,
> if IA32_APERF is counting at just under double the TSC frequency, then
> IA32_APERF will hit 0 first, but just barely. In this case, at t1, the
> MPERF delta will be ~1000, and the APERF delta will be 2000.
>
> Your code only works in the latter case, where both IA32_APERF and
> IA32_MPERF hit 0 at the same time. The fundamental problem is the
> handling of the wrap-around in get_amperf_delta. You construct the
> wrap-around delta as if the counter went all the way to ULONG_MAX
> before being reset to 0, yet, we know that one of the counters is not
> likely to have made it that far.
>
> > > The whole point of this exercise is that the counters do not always
> > > run at the same frequency.
> > >
> > >> +
> > >> + if (mdelta > adelta && mdelta > aperf_left) {
> > >> + vcpu->arch.hwp.msrs[0] = 0;
> > >> + vcpu->arch.hwp.msrs[1] = mdelta - mperf_left - 1;
> > >> + } else {
> > >> + vcpu->arch.hwp.msrs[0] = adelta - aperf_left - 1;
> > >> + vcpu->arch.hwp.msrs[1] = 0;
> > >> + }
> > >
> > > I don't understand this code at all. It seems quite unlikely that you
> >
> > The value of two msr's will affect the other when one overflows:
> >
> > * When either MSR overflows, both MSRs are reset to zero and
> > continue to increment. [Intel SDM, CHAPTER 14, 14.2]
> >
> > > are ever going to catch a wraparound at just the right point for one
> > > of the MSRs to be 0. Moreover, since the two counters are not counting
> > > the same thing, it doesn't seem likely that it would ever be correct
> > > to derive the guest's IA32_APERF value from IA32_MPERF or vice versa.
> > >
> > >> +}
> > >> +
> > >> /*
> > >> * Returns 1 to let vcpu_run() continue the guest execution loop without
> > >> * exiting to the userspace. Otherwise, the value will be returned to the
> > >> @@ -9700,7 +9770,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >> dm_request_for_irq_injection(vcpu) &&
> > >> kvm_cpu_accept_dm_intr(vcpu);
> > >> fastpath_t exit_fastpath;
> > >> -
> > >> + u64 before[2], after[2];
> > >> bool req_immediate_exit = false;
> > >>
> > >> /* Forbid vmenter if vcpu dirty ring is soft-full */
> > >> @@ -9942,7 +10012,16 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >> */
> > >> WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
> > >>
> > >> - exit_fastpath = static_call(kvm_x86_run)(vcpu);
> > >> + if (likely(vcpu->arch.hwp.fast_path)) {
> > >> + exit_fastpath = static_call(kvm_x86_run)(vcpu);
> > >> + } else {
> > >> + get_host_amperf(before);
> > >> + exit_fastpath = static_call(kvm_x86_run)(vcpu);
> > >> + get_host_amperf(after);
> > >> + update_vcpu_amperf(vcpu, get_amperf_delta(before[0], after[0]),
> > >> + get_amperf_delta(before[1], after[1]));
> > >> + }
> > >> +
> > > The slow path is awfully expensive here. Shouldn't there also be an
> > > option to do none of this, if the guest doesn't advertise CPUID.06H:
> > > ECX[0]?
> >
> > Yes, it looks pretty good to me and let me figure it out.
>
> Your slow path seems fundamentally broken, in that IA32_MPERF only
> counts while the vCPU thread is running. It should count all of the
> time, just as the guest TSC does. For example, we offer a low-cost VM
> that is throttled to run at most 50% of the time. Anyone looking at
> the APERF/MPERF ratio for such a VM should see the 50% duty cycle
> reflected as IA32_APERF advancing at half the frequency of IA32_MPERF.
> However, if IA32_MPERF only advances when the vCPU thread is running,
> the apparent performance will be inflated by 2x.

Actually, your fast path is similarly broken, in that IA32_APERF
should only count while the vCPU is running (or at least scheduled).
As it stands, the 50% duty cycle VM will get an inflated APERF/MPERF
ratio using the fast path, because it will be credited APERF cycles
while it is descheduled and other tasks are running. Per section
14.5.5 of the SDM, "The IA32_APERF counter does not count during
forced idle state." A vCPU thread being descheduled is the virtual
machine equivalent of a logical processor being forced idle by HDC.

>
> > >
> > >> if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST))
> > >> break;
> > >>
> > >> @@ -11138,6 +11217,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > >> vcpu->arch.xcr0 = XFEATURE_MASK_FP;
> > >> }
> > >>
> > >> + memset(vcpu->arch.hwp.msrs, 0, sizeof(vcpu->arch.hwp.msrs));
> > >> +
> > >> /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
> > >> memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> > >> kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);
> > >> --
> > >> 2.33.1
> > >>
> > >