Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

From: Sean Christopherson
Date: Wed May 01 2024 - 16:36:46 EST


On Wed, May 01, 2024, Mingwei Zhang wrote:
> On Mon, Apr 29, 2024 at 10:44 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Sat, Apr 27, 2024, Mingwei Zhang wrote:
> > > That's ok. It is about opinions and brainstorming. Adding a parameter
> > > to disable preemption is from the cloud usage perspective. The
> > > conflict of opinions is which one you prioritize: guest PMU or the
> > > host PMU? If you stand on the guest vPMU usage perspective, do you
> > > want anyone on the host to shoot a profiling command and generate
> > > turbulence? no. If you stand on the host PMU perspective and you want
> > > to profile VMM/KVM, you definitely want accuracy and no delay at all.
> >
> > Hard no from me. Attempting to support two fundamentally different models means
> > twice the maintenance burden. The *best* case scenario is that usage is roughly
> > a 50/50 spit. The worst case scenario is that the majority of users favor one
> > model over the other, thus resulting in extremely limited tested of the minority
> > model.
> >
> > KVM already has this problem with scheduler preemption models, and it's painful.
> > The overwhelming majority of KVM users run non-preemptible kernels, and so our
> > test coverage for preemtible kernels is abysmal.
> >
> > E.g. the TDP MMU effectively had a fatal flaw with preemptible kernels that went
> > unnoticed for many kernel releases[*], until _another_ bug introduced with dynamic
> > preemption models resulted in users running code that was supposed to be specific
> > to preemtible kernels.
> >
> > [* https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@xxxxxxxxxxx
> >
>
> I hear your voice, Sean.
>
> In our cloud, we have a host-level profiling going on for all cores
> periodically. It will be profiling X seconds every Y minute. Having
> the host-level profiling using exclude_guest is fine, but stopping the
> host-level profiling is a no no. Tweaking the X and Y is theoretically
> possible, but highly likely out of the scope of virtualization. Now,
> some of the VMs might be actively using vPMU at the same time. How can
> we properly ensure the guest vPMU has consistent performance? Instead
> of letting the VM suffer from the high overhead of PMU for X seconds
> of every Y minute?
>
> Any thought/help is appreciated. I see the logic of having preemption
> there for correctness of the profiling on the host level. Doing this,
> however, negatively impacts the above business usage.
>
> One of the things on top of the mind is that: there seems to be no way
> for the perf subsystem to express this: "no, your host-level profiling
> is not interested in profiling the KVM_RUN loop when our guest vPMU is
> actively running".

For good reason, IMO. The KVM_RUN loop can reach _far_ outside of KVM, especially
when IRQs and NMIs are involved. I don't think anyone can reasonably say that
profiling is never interested in what happens while a task in KVM_RUN. E.g if
there's a bottleneck in some memory allocation flow that happens to be triggered
in the greater KVM_RUN loop, that's something we'd want to show up in our profiling
data.

And if our systems our properly configured, for VMs with a mediated/passthrough
PMU, 99.99999% of their associated pCPU's time should be spent in KVM_RUN. If
that's our reality, what's the point of profiling if KVM_RUN is out of scope?

We could make the context switching logic more sophisticated, e.g. trigger a
context switch when control leaves KVM, a la the ASI concepts, but that's all but
guaranteed to be overkill, and would have a very high maintenance cost.

But we can likely get what we want (low observed overhead from the guest) while
still context switching PMU state in vcpu_enter_guest(). KVM already handles the
hottest VM-Exit reasons in its fastpath, i.e without triggering a PMU context
switch. For a variety of reason, I think we should be more aggressive and handle
more VM-Exits in the fastpath, e.g. I can't think of any reason KVM can't handle
fast page faults in the fastpath.

If we handle that overwhelming majority of VM-Exits in the fastpath when the guest
is already booted, e.g. when vCPUs aren't taking a high number of "slow" VM-Exits,
then the fact that slow VM-Exits trigger a PMU context switch should be a non-issue,
because taking a slow exit would be a rare operation.

I.e. rather than solving the overhead problem by moving around the context switch
logic, solve the problem by moving KVM code inside the "guest PMU" section. It's
essentially a different way of doing the same thing, with the critical difference
being that only hand-selected flows are excluded from profiling, i.e. only the
flows that need to be blazing fast and should be uninteresting from a profiling
perspective are excluded.