Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

From: Sean Christopherson
Date: Wed Mar 02 2022 - 13:01:48 EST


On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 2/26/22 01:15, Sean Christopherson wrote:
> > Zap defunct roots, a.k.a. roots that have been invalidated after their
> > last reference was initially dropped, asynchronously via the system work
> > queue instead of forcing the work upon the unfortunate task that happened
> > to drop the last reference.
> >
> > If a vCPU task drops the last reference, the vCPU is effectively blocked
> > by the host for the entire duration of the zap. If the root being zapped
> > happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> > being active, the zap can take several hundred seconds. Unsurprisingly,
> > most guests are unhappy if a vCPU disappears for hundreds of seconds.
> >
> > E.g. running a synthetic selftest that triggers a vCPU root zap with
> > ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> > Offloading the zap to a worker drops the block time to <100ms.
> >
> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > ---
>
> Do we even need kvm_tdp_mmu_zap_invalidated_roots() now? That is,
> something like the following:

Nice! I initially did something similar (moving invalidated roots to a separate
list), but never circled back to idea after implementing the worker stuff.

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bd3625a875ef..5fd8bc858c6f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> {
> lockdep_assert_held(&kvm->slots_lock);
> + /*
> + * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference
> + * count. If we're dying, zap everything as it's going to happen
> + * soon anyway.
> + */
> + if (!refcount_read(&kvm->users_count)) {
> + kvm_mmu_zap_all(kvm);
> + return;
> + }

I'd prefer we make this an assertion and shove this logic to set_nx_huge_pages(),
because in that case there's no need to zap anything, the guest can never run
again. E.g. (I'm trying to remember why I didn't do this before...)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2c1c4eb6007..d4d25ab88ae7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6132,7 +6132,8 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)

list_for_each_entry(kvm, &vm_list, vm_list) {
mutex_lock(&kvm->slots_lock);
- kvm_mmu_zap_all_fast(kvm);
+ if (refcount_read(&kvm->users_count))
+ kvm_mmu_zap_all_fast(kvm);
mutex_unlock(&kvm->slots_lock);

wake_up_process(kvm->arch.nx_lpage_recovery_thread);


> +
> write_lock(&kvm->mmu_lock);
> trace_kvm_mmu_zap_all_fast(kvm);
> @@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> kvm_zap_obsolete_pages(kvm);
> write_unlock(&kvm->mmu_lock);
> -
> - /*
> - * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
> - * returning to the caller, e.g. if the zap is in response to a memslot
> - * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
> - * associated with the deleted memslot once the update completes, and
> - * Deferring the zap until the final reference to the root is put would
> - * lead to use-after-free.
> - */
> - if (is_tdp_mmu_enabled(kvm)) {
> - read_lock(&kvm->mmu_lock);
> - kvm_tdp_mmu_zap_invalidated_roots(kvm);
> - read_unlock(&kvm->mmu_lock);
> - }
> }
> static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cd1bf68e7511..af9db5b8f713 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> WARN_ON(!root->tdp_mmu_page);
> /*
> - * The root now has refcount=0 and is valid. Readers cannot acquire
> - * a reference to it (they all visit valid roots only, except for
> - * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
> - * any reference itself.
> + * The root now has refcount=0. It is valid, but readers already
> + * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
> + * rejects it. This remains true for the rest of the execution
> + * of this function, because readers visit valid roots only

One thing that keeps tripping me up is the "readers" verbiage. I get confused
because taking mmu_lock for read vs. write doesn't really have anything to do with
reading or writing state, e.g. "readers" still write SPTEs, and so I keep thinking
"readers" means anything iterating over the set of roots. Not sure if there's a
shorthand that won't be confusing.

> + * (except for tdp_mmu_zap_root_work(), which however operates only
> + * on one specific root and does not acquire any reference itself).
>
> *
> * Even though there are flows that need to visit all roots for
> * correctness, they all take mmu_lock for write, so they cannot yet

...

> It passes a smoke test, and also resolves the debate on the fate of patch 1.

+1000, I love this approach. Do you want me to work on a v3, or shall I let you
have the honors?