Re: [PATCH v4 12/12] KVM: x86/mmu: convert kvm_zap_gfn_range() to use shared mmu_lock in TDP MMU

From: Yan Zhao
Date: Mon Sep 04 2023 - 03:59:47 EST


On Fri, Aug 25, 2023 at 02:34:30PM -0700, Sean Christopherson wrote:
> On Fri, Jul 14, 2023, Yan Zhao wrote:
> > Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> > read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> > TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> > by RCU lock.
> >
> > GFN zap can be super slow if mmu_lock is held for write when there are
> > contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> > GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> > drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> > for every GFN.
> > Contentions can either from concurrent zaps holding mmu_lock for write or
> > from tdp_mmu_map() holding mmu_lock for read.
>
> The lock contention should go away with a pre-check[*], correct? That's a more
Yes, I think so, though I don't have time to verify it yet.

> complete solution too, in that it also avoids lock contention for the shadow MMU,
> which presumably suffers the same problem (I don't see anything that would prevent
> it from yielding).
>
> If we do want to zap with mmu_lock held for read, I think we should convert
> kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
> missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
> the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
> tasks will race to install SPTEs that are supposed to be zapped.
Yes. I did't do that to the unmap path was only because I don't want to make a
big code change.
The write lock in kvm_unmap_gfn_range() path is taken in arch-agnostic code,
which is not easy to change, right?

>
> If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
> post it as a standalone patch. At a glance it doesn't have any dependencies on the
> MTRR changes, and I don't want this type of changed buried at the end of a series
> that is for a fairly niche setup. This needs a lot of scrutiny to make sure zapping
> under read really is safe
Given the pre-check patch should work, do you think it's still worthwhile to do
this convertion?

>
> [*] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@xxxxxxxxxx
>
> > After converting to hold mmu_lock for read, there will be less contentions
> > detected and retaking mmu_lock for read is also faster. There's no need to
> > flush TLB before dropping mmu_lock when there're contentions as SPTEs have
> > been zapped atomically and TLBs are flushed/flush requested immediately
> > within RCU lock.
> > In order to reduce TLB flush count, non-leaf SPTEs not greater than 1G
> > level are allowed to be zapped if their ranges are fully covered in the
> > gfn zap range.
> >
> > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
> > ---