Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure

From: Sean Christopherson
Date: Mon Jan 23 2023 - 12:41:43 EST


On Mon, Jan 23, 2023, Huang, Kai wrote:
> >
> >
> > Intel folks,
> >
> > Do you happen to know exactly what scenario prompted adding the freeze+unfreeze
> > code? Is there something I'm forgetting/missing, or is it possible we can go
> > with a simpler implementation?
>
> It's documented in the "TDX TDP MMU design doc" patch:
>
> +TDX concurrent populating
> +-------------------------
> ......
> +
> +Without freezing the entry, the following race can happen. Suppose two vcpus
> +are faulting on the same GPA and the 2M and 4K level entries aren't populated
> +yet.
> +
> +* vcpu 1: update 2M level EPT entry
> +* vcpu 2: update 4K level EPT entry
> +* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
> +* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry

Ooh, the problem isn't that two SEAMCALLs to the same entry get out of order, it's
that SEAMCALLs operating on child entries can race ahead of the parent. Hrm.

TDX also has the annoying-but-understandable restriction that leafs need to be
removed before parents. A not-yet-relevant complication on that front is that the
TDP MMU's behavior of recursively removing children means we also have to worry
about PRESENT => !PRESENT transitions, e.g. zapping a child because the parent is
being removed can race with a different vCPU try to populate the child (because
the vCPU handling a page fault could have seen the PRESENT parent).

I think there's an opportunity and motivation to improve the TDP MMU as a whole on
this front though. Rather than recursively zap children in handle_removed_pt(),
we can use the RCU callback to queue the page table for removal. Setting the parent
(target page table) !PRESENT and flushing the TLBs ensures that all children are
unreachable, i.e. KVM doesn't need to immediately set children !PRESENT. Unlike
the shadow MMU, which maintains a hash table of shadow pages, once a parent page
table is removed from the TDP MMU, its children are unreachabled.

The RCU callback must run in near-constant time, but that's easy to solve as we
already have a workqueue for zapping page tables, i.e. the RCU callback can simply
add the target page to the zap workqueue. That would also allow for a (very minor)
simplification of other TDP MMU code: tdp_mmu_zap_root() wouldn't needed to zap in
two passes since zapping children of the top-level SPTEs would be deferred to the
workqueue.

Back to TDX, to play nice with the restriction that parents are removed only after
children are removed, I believe KVM can use TDH.MEM.RANGE.BLOCK to make the parent
!PRESENT. That will effectively prune the S-EPT entry and all its children, and
the RCU callback will again ensure all in-flight SEAMCALLs for the children complete
before KVM actually tries to zap the children.

And if we rework zapping page tables, I suspect we can also address David's concern
(and my not-yet-voiced concern) about polluting the TDP MMU code with logic that is
necessary only for S-EPT (freezing SPTEs before populating them). Rather than update
S-EPT _after_ the TDP MMU SPTE, do the S-EPT update first, i.e. invoke the KVM TDX
hook before try_cmpxchg64() (or maybe instead of?). That way KVM TDX can freeze the
to-be-installed SPTE without common TDP MMU needing to be aware of the change.

> ( I guess such material will be more useful in the comment. And perhaps we can
> get rid of the "TDX TDP MMU design doc" patch in this series at least for now as
> probably nobody will look at it :-) )

Please keep the design doc, I'll definitely read it. I'm just chasing too many
things at the moment and haven't given the TDX series a proper review, i.e. haven't
even glanced through all of the patches or even the shortlog.