Re: [PATCH 02/16] KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion

From: Huang, Kai
Date: Wed May 15 2024 - 09:25:06 EST


On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> From: Yan Zhao <yan.y.zhao@xxxxxxxxx>
>
> Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
> SPTEs when deleting a memslot.
>
> Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
> Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
> AMD) and generate fresh new TDPs based on the new memslot layout. This is
> because zapping and re-generating TDPs is low overhead for most use cases,
> and more importantly, it's due to a bug [1] which caused VM instability
> when a VM is with Nvidia Geforce GPU assigned.
>
> There's a previous attempt [2] to introduce a per-VM flag to workaround bug
> [1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
> However, [2] was not merged due to lacking of a clear explanation of
> exactly what is broken [3] and it's not wise to "have a bug that is known
> to happen when you enable the capability".
>
> However, for some specific scenarios, e.g. TDX, invalidating and
> re-generating a new page table is not viable for reasons:
> - TDX requires root page of private page table remains unaltered throughout
> the TD life cycle.
> - TDX mandates that leaf entries in private page table must be zapped prior
> to non-leaf entries.
>
> So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> again for VMs like TDX. [4]
>
> This patch is an implementation of per-memslot flag.
> Compared to per-VM flag approach,
> Pros:
> (1) By allowing userspace to control the zapping behavior in fine-grained
> granularity, optimizations for specific use cases can be developed
> without future kernel changes.
> (2) Allows developing new zapping behaviors without risking regressions by
> changing KVM behavior, as seen previously.
>
> Cons:
> (1) Users need to ensure all necessary memslots are with flag
> KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
> memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> (2) Opens up the possibility that userspace could configure memslots for
> normal VM in such a way that the bug [1] is seen.

I don't quite follow the logic why userspace should be involved.

TDX cannot use "page table fast zap", and need to use a different way to
zap, a.k.a, zap-leaf-only while holding MMU write lock, but this doesn't
necessarily mean such thing should be exposed to userspace?

It's weird that userspace needs to control how does KVM zap page table for
memslot delete/move.

The [2] mentions there are performance improvement for certain VMs if KVM
does zap-leaf-only, but AFAICT it doesn't provide concrete argument why it
needs to be exposed to userspace so it can be done as per-vm (that whole
thread basically was talking about the bug in [1]). E.g., see:

https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#m702b273057cc318465cb5a1677d94e923dce9832

"
Ya, a capability is a bad idea. I was coming at it from the angle that, if
there is a fundamental requirement with e.g. GPU passthrough that requires
zapping all SPTEs, then enabling the precise capability on a per-VM basis
would make sense. But adding something to the ABI on pure speculation is
silly.
"

So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
We can just set this flag for a TDX guest when memslot is created in KVM.

>
> However, one thing deserves noting for TDX, is that TDX may potentially
> meet bug [1] for either per-memslot flag or per-VM flag approach, since
> there's a usage in radar to assign an untrusted & passthrough GPU device
> in TDX. If that happens, it can be treated as a bug (not regression) and
> fixed accordingly.
>
> An alternative approach we can also consider is to always invalidate &
> rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
> and private page tables on memslot deletion. This approach could exempt TDX
> from bug [1] when "untrusted & passthrough" devices are involved. But
> downside is that this approach requires creating new very specific KVM
> zapping ABI that could limit future changes in the same way that the bug
> did for normal VMs.
>
> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@xxxxxxxxx [1]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
> Link: https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@xxxxxxxxxx [4]
> Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
> ---
> TDX MMU Part 1:
> - New patch
> ---
> arch/x86/kvm/mmu/mmu.c | 30 +++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 17 +++++++++++++++++
> include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 5 ++++-
> 4 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 61982da8c8b2..4a8e819794db 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
> kvm_mmu_zap_all(kvm);
> }
>
> +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> + return;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + struct kvm_gfn_range range = {
> + .slot = slot,
> + .start = slot->base_gfn,
> + .end = slot->base_gfn + slot->npages,
> + .may_block = true,
> + };
> +
> + if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
> + kvm_flush_remote_tlbs(kvm);
> +
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
> + kvm_mmu_zap_memslot_leafs(kvm, slot);
> + else
> + kvm_mmu_zap_all_fast(kvm);
> }
>
> void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c593a081eba..4b3ec2ec79e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> return -EINVAL;
>
> + /*
> + * Since TDX private pages requires re-accepting after zap,
> + * and TDX private root page should not be zapped, TDX requires
> + * memslots for private memory must have flag
> + * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
> + * the deleting memslot will be zapped and SPTEs in other
> + * memslots would not be affected.
> + */
> + if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
> + (new->flags & KVM_MEM_GUEST_MEMFD) &&
> + !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
> + return -EINVAL;
> +
> + /* zap-leafs-only works only when TDP MMU is enabled for now */
> + if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
> + return -EINVAL;

If this zap-leaf-only is supposed to be generic, I don't see why we want
to make it only for TDP MMU?