Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

From: Edgecombe, Rick P
Date: Mon Jun 16 2025 - 20:25:53 EST


On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> > Few questions here:
> > 1) It sounds like the failure to remove entries from SEPT could only
> > be due to bugs in the KVM/TDX module,
> Yes.

A TDX module bug could hypothetically cause many types of host instability. We
should consider a little more on the context for the risk before we make TDX a
special case or add much error handling code around it. If we end up with a
bunch of paranoid error handling code around TDX module behavior, that is going
to be a pain to maintain. And error handling code for rare cases will be hard to
remove.

We've had a history of unreliable page removal during the base series
development. When we solved the problem, it was not completely clean (though
more on the guest affecting side). So I think there is reason to be concerned.
But this should work reliably in theory. So I'm not sure we should use the error
case as a hard reason. Instead maybe we should focus on how to make it less
likely to have an error. Unless there is a specific case you are considering,
Yan?

That said, I think the refcounting on error (or rather, notifying guestmemfd on
error do let it handle the error how it wants) is a fine solution. As long as it
doesn't take much code (as is the case for Yan's POC).

>
> > how reliable would it be to
> > continue executing TDX VMs on the host once such bugs are hit?
> The TDX VMs will be killed. However, the private pages are still mapped in the
> SEPT (after the unmapping failure).
> The teardown flow for TDX VM is:
>
> do_exit
>   |->exit_files
>      |->kvm_gmem_release ==> (1) Unmap guest pages
>      |->release kvmfd
>         |->kvm_destroy_vm  (2) Reclaiming resources
>            |->kvm_arch_pre_destroy_vm  ==> Release hkid
>            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
>
> Without holding page reference after (1) fails, the guest pages may have been
> re-assigned by the host OS while they are still still tracked in the TDX
> module.
>
>
> > 2) Is it reliable to continue executing the host kernel and other
> > normal VMs once such bugs are hit?
> If with TDX holding the page ref count, the impact of unmapping failure of
> guest
> pages is just to leak those pages.

If the kernel might be able to continue working, it should try. It should warn
if there is a risk, so people can use panic_on_warn if they want to stop the
kernel.

>
> > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > and cleaned up right away?
> As in the above flow, TDX needs to hold the page reference on unmapping
> failure
> until after reclaiming is successful. Well, reclaiming itself is possible to
> fail either.

We could ask TDX module folks if there is anything they could guarantee.