Re: [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock

From: Huang, Kai
Date: Thu Sep 07 2023 - 19:30:50 EST


On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
> though mmu_lock must be held to guarantee correctness, i.e. to avoid
> false negatives. Dropping the requirement that mmu_lock be held will
> allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
> contending mmu_lock when the guest is accessing a range that is being
> invalidated by the host.
>
> Contending mmu_lock can have severe negative side effects for x86's TDP
> MMU when running on preemptible kernels, as KVM will yield from the
> zapping task (holds mmu_lock for write) when there is lock contention,
> and yielding after any SPTEs have been zapped requires a VM-scoped TLB
> flush.
>
> Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
> mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
> e.g. due to caching the in-progress flag and never seeing it go to '0'.
>
> Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> necessary to avoid an infinite loop, as doing so improves the probability
> that KVM will detect an invalidation that already completed before
> acquiring mmu_lock and bailing anyways.

Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and
mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like
this:

fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; <-- (1)
smp_rmb();

...
READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress);
...

if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq) <-- (2)
...

Perhaps stupid question :-) Will compiler even believes both vcpu->kvm-
>mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the
code in 1) and 2)? Or all the barriers between are enough to prevent compiler
to do such stupid thing?

>
> Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> may generate a load into a register instead of doing a direct comparison
> (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> is a few bytes of code and maaaaybe a cycle or three.
>
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> include/linux/kvm_host.h | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7418e881c21c..7314138ba5f4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1962,18 +1962,29 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> unsigned long mmu_seq,
> unsigned long hva)
> {
> - lockdep_assert_held(&kvm->mmu_lock);
> /*
> * If mmu_invalidate_in_progress is non-zero, then the range maintained
> * by kvm_mmu_notifier_invalidate_range_start contains all addresses
> * that might be being invalidated. Note that it may include some false
> * positives, due to shortcuts when handing concurrent invalidations.
> + *
> + * Note the lack of a memory barriers! The caller *must* hold mmu_lock
> + * to avoid false negatives! Holding mmu_lock is not mandatory though,
> + * e.g. to allow pre-checking for an in-progress invalidation to
> + * avoiding contending mmu_lock. Ensure that the in-progress flag and
> + * sequence counter are always read from memory, so that checking for
> + * retry in a loop won't result in an infinite retry loop. Don't force
> + * loads for start+end, as the key to avoiding an infinite retry loops
> + * is observing the 1=>0 transition of in-progress, i.e. getting false
> + * negatives (if mmu_lock isn't held) due to stale start+end values is
> + * acceptable.
> */
> - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> hva >= kvm->mmu_invalidate_range_start &&
> hva < kvm->mmu_invalidate_range_end)
> return 1;
> - if (kvm->mmu_invalidate_seq != mmu_seq)
> +
> + if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
> return 1;
> return 0;
> }

I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks
all those should be theoretical thing, but the extra cost should be nearly empty
as you said.

Acked-by: Kai Huang <kai.huang@xxxxxxxxx>

(Or perhaps patch 1/2 can be just merged to one patch)