Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure

From: Sean Christopherson
Date: Thu Jan 26 2023 - 12:28:48 EST


On Thu, Jan 26, 2023, Huang, Kai wrote:
> On Mon, 2023-01-23 at 17:41 +0000, Sean Christopherson wrote:
> > I think there's an opportunity and motivation to improve the TDP MMU as a whole on
> > this front though. Rather than recursively zap children in handle_removed_pt(),
> > we can use the RCU callback to queue the page table for removal. Setting the parent
> > (target page table) !PRESENT and flushing the TLBs ensures that all children are
> > unreachable, i.e. KVM doesn't need to immediately set children !PRESENT. Unlike
> > the shadow MMU, which maintains a hash table of shadow pages, once a parent page
> > table is removed from the TDP MMU, its children are unreachabled.
>
> Do you mean something like (pseudo):
>
> rcu_callback(&removed_sp->rcu_head, handle_removed_pt);

Yep.

> > The RCU callback must run in near-constant time, but that's easy to solve as we
> > already have a workqueue for zapping page tables, i.e. the RCU callback can simply
> > add the target page to the zap workqueue. That would also allow for a (very minor)
> > simplification of other TDP MMU code: tdp_mmu_zap_root() wouldn't needed to zap in
> > two passes since zapping children of the top-level SPTEs would be deferred to the
> > workqueue.
>
> Do you mean zapping the entire page table (from root) doesn't need to be in RCU
> read-critical section, but can/should be done after grace period? I think this
> makes sense since zapping entire root must happen when root is already invalid,
> which cannot be used anymore when the new faults come in?

Yes, minus the "from root" restriction. When a page table (call it "branch" to
continue the analogy) PTE has been zapped/blocked ("pruned"), KVM just needs to
wait for all potential readers to go away. That guarantee is provided by RCU;
software walkers, i.e. KVM itself, are required to hold RCU, and hardware walkers,
i.e. vCPUs running in the guest, are protected by proxy as the zapper ("arborist"?)
is required to hold RCU until all running vCPUs have been kicked.

In other words, once the PTE is zapped/blocked (branch is pruned), it's completely
removed from the paging tree and no other tasks can access the branch (page table
and its children). I.e. the only remaining reference to the branch is the pointer
handed to the RCU callback. That means the RCU callback has exclusive access to the
branch, i.e. can operate as if it were holding mmu_lock for write. Furthermore, the
RCU callback also doesn't need to flush TLBs because that was again done when
pruning the branch.

It's the same idea that KVM already uses for root SPs, the only difference is how
KVM determines that there is exactly one entity that holds a reference to the SP.

> > Back to TDX, to play nice with the restriction that parents are removed only after
> > children are removed, I believe KVM can use TDH.MEM.RANGE.BLOCK to make the parent
> > !PRESENT. That will effectively prune the S-EPT entry and all its children, and
> > the RCU callback will again ensure all in-flight SEAMCALLs for the children complete
> > before KVM actually tries to zap the children.
>
> Reading the spec, it seems TDH.MEM.RANGE.BLOCK only sets the Secure EPT entry
> which points to the entire range as "blocked", but won't go down until leaf to
> mark all EPT entries as "blocked", which makes sense anyway.
>
> But it seems TDH.MEM.PAGE.REMOVE and TDH.MEM.SEPT.REMOVE both only checks
> whether that target EPT entry is "blocked", but doesn't check whether any parent
> has been marked as "blocked". Not sure whether this will be a problem. But
> anyway if this is a problem, we perhaps can get TDX module to fix.

Oh, I didn't mean to suggest KVM skip TDH.MEM.RANGE.BLOCK for children, I simply
forgot that all S-EPT entries need to be blocked before they can be removed.

> > And if we rework zapping page tables, I suspect we can also address David's concern
> > (and my not-yet-voiced concern) about polluting the TDP MMU code with logic that is
> > necessary only for S-EPT (freezing SPTEs before populating them). Rather than update
> > S-EPT _after_ the TDP MMU SPTE, do the S-EPT update first, i.e. invoke the KVM TDX
> > hook before try_cmpxchg64() (or maybe instead of?). That way KVM TDX can freeze the
> > to-be-installed SPTE without common TDP MMU needing to be aware of the change.
>
> I don't quite understand how putting SEAMCALL before the try_cmpxchg64() can
> work. Let's say one thread is populating a mapping and another is zapping it.
> The populating thread makes SEAMCALL successfully but then try_cmpxchg64()
> fails, in this case how to proceed?

Ah, sorry, that was unclear. By "invoke the KVM TDX hook" I didn't mean "do the
SEAMCALL", I meant KVM TDX could do its own manipulation of the KVM-managed SPTEs
before the common/standard flow. E.g. something like:

if (kvm_x86_ops.set_private_spte && private)
r = static_call(kvm_x86_set_private_spte(...)
else
r = try_cmpxchg64(...) ? 0 : -EBUSY;

so that the common code doesn't need to do, or even be aware of, the freezing.
Then I think we just need another hook in handle_removed_pt(), or maybe in what
is currently __kvm_tdp_mmu_write_spte()?

I.e. fully replace the "write" operations in the TDP MMU instead of trying to
smush S-EPT's requirements into the common path.