Re: [PATCH v4 12/12] KVM: x86/mmu: convert kvm_zap_gfn_range() to use shared mmu_lock in TDP MMU

From: Yan Zhao
Date: Tue Sep 05 2023 - 21:18:16 EST


On Tue, Sep 05, 2023 at 03:31:59PM -0700, Sean Christopherson wrote:
> On Mon, Sep 04, 2023, Yan Zhao wrote:
> > On Fri, Aug 25, 2023 at 02:34:30PM -0700, Sean Christopherson wrote:
> > > On Fri, Jul 14, 2023, Yan Zhao wrote:
> > > > Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> > > > read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> > > > TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> > > > by RCU lock.
> > > >
> > > > GFN zap can be super slow if mmu_lock is held for write when there are
> > > > contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> > > > GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> > > > drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> > > > for every GFN.
> > > > Contentions can either from concurrent zaps holding mmu_lock for write or
> > > > from tdp_mmu_map() holding mmu_lock for read.
> > >
> > > The lock contention should go away with a pre-check[*], correct? That's a more
> > Yes, I think so, though I don't have time to verify it yet.
> >
> > > complete solution too, in that it also avoids lock contention for the shadow MMU,
> > > which presumably suffers the same problem (I don't see anything that would prevent
> > > it from yielding).
> > >
> > > If we do want to zap with mmu_lock held for read, I think we should convert
> > > kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
> > > missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
> > > the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
> > > tasks will race to install SPTEs that are supposed to be zapped.
> > Yes. I did't do that to the unmap path was only because I don't want to make a
> > big code change.
> > The write lock in kvm_unmap_gfn_range() path is taken in arch-agnostic code,
> > which is not easy to change, right?
>
> Yeah. The lock itself isn't bad, especially if we can convert all mmu_nofitier
> hooks, e.g. we already have KVM_MMU_LOCK(), adding a variant for mmu_notifiers
> would be quite easy.
>
> The bigger problem would be kvm_mmu_invalidate_{begin,end}() and getting the
> memory ordering right, especially if there are multiple mmu_notifier events in
> flight.
>
> But I was actually thinking of a cheesier approach: drop and reacquire mmu_lock
> when zapping, e.g. without the necessary changes in tdp_mmu_zap_leafs():
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 735c976913c2..c89a2511789b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -882,9 +882,15 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> {
> struct kvm_mmu_page *root;
>
> + write_unlock(&kvm->mmu_lock);
> + read_lock(&kvm->mmu_lock);
> +
> for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
>
> + read_unlock(&kvm->mmu_lock);
> + write_lock(&kvm->mmu_lock);
> +
> return flush;
> }
>
> vCPUs would still get blocked, but for a smaller duration, and the lock contention
> between vCPUs and the zapping task would mostly go away.
>
Yes, I actually did similar thing locally, i.e. releasing write lock and taking
read lock before zapping.
But yes, I also think it's cheesier as the caller of the write lock knows nothing
about its write lock was replaced with read lock.


> > > If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
> > > post it as a standalone patch. At a glance it doesn't have any dependencies on the
> > > MTRR changes, and I don't want this type of changed buried at the end of a series
> > > that is for a fairly niche setup. This needs a lot of scrutiny to make sure zapping
> > > under read really is safe
> > Given the pre-check patch should work, do you think it's still worthwhile to do
> > this convertion?
>
> I do think it would be a net positive, though I don't know that it's worth your
> time without a concrete use cases. My gut instinct could be wrong, so I wouldn't
> want to take on the risk of running with mmu_lock held for read without hard
> performance numbers to justify the change.
Ok, I see. May try conversion later if I found out the performance justification.

Thanks!