Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker

From: Sean Christopherson
Date: Fri Mar 04 2022 - 19:34:39 EST


On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> On 3/4/22 17:02, Sean Christopherson wrote:
> > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > On 3/3/22 22:32, Sean Christopherson wrote:
> > > I didn't remove the paragraph from the commit message, but I think it's
> > > unnecessary now. The workqueue is flushed in kvm_mmu_zap_all_fast() and
> > > kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> > > a reference to the VM.
> > >
> > > I think I don't even need to check kvm->users_count in the defunct root
> > > case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> > > before it checks that the lists are empty.
> >
> > Yes, that should work. IIRC, the WARN_ONs will tell us/you quite quickly if
> > we're wrong :-) mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> > and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> > the WARN_ON that there are no roots on the list.
>
> Good, for the record these are the commit messages I have:
>
> KVM: x86/mmu: Zap invalidated roots via asynchronous worker
> Use the system worker threads to zap the roots invalidated
> by the TDP MMU's "fast zap" mechanism, implemented by
> kvm_tdp_mmu_invalidate_all_roots().
> At this point, apart from allowing some parallelism in the zapping of
> roots, the workqueue is a glorified linked list: work items are added and
> flushed entirely within a single kvm->slots_lock critical section. However,
> the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
> assumes that it owns a reference to all invalid roots; therefore, no
> one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the
> invalidated roots on a linked list... erm, on a workqueue ensures that
> tdp_mmu_zap_root_work() only puts back those extra references that
> kvm_mmu_zap_all_invalidated_roots() had gifted to it.
>
> and
>
> KVM: x86/mmu: Zap defunct roots via asynchronous worker
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the existing work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap. If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds. Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
> There is an important nuance to this change. If the same work item
> was queued twice before the work function has run, it would only
> execute once and one reference would be leaked. Therefore, now that
> queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
> kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
> skip already invalid roots. On the other hand, kvm_mmu_zap_all_fast()
> must return only after those skipped roots have been zapped as well.
> These two requirements can be satisfied only if _all_ places that
> change invalid to true now schedule the worker before releasing the
> mmu_lock. There are just two, kvm_tdp_mmu_put_root() and
> kvm_tdp_mmu_invalidate_all_roots().

Very nice!