Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure

From: Huang, Kai
Date: Thu Jan 26 2023 - 05:55:09 EST


On Mon, 2023-01-23 at 17:41 +0000, Sean Christopherson wrote:
> On Mon, Jan 23, 2023, Huang, Kai wrote:
> > >
> > >
> > > Intel folks,
> > >
> > > Do you happen to know exactly what scenario prompted adding the freeze+unfreeze
> > > code? Is there something I'm forgetting/missing, or is it possible we can go
> > > with a simpler implementation?
> >
> > It's documented in the "TDX TDP MMU design doc" patch:
> >
> > +TDX concurrent populating
> > +-------------------------
> > ......
> > +
> > +Without freezing the entry, the following race can happen. Suppose two vcpus
> > +are faulting on the same GPA and the 2M and 4K level entries aren't populated
> > +yet.
> > +
> > +* vcpu 1: update 2M level EPT entry
> > +* vcpu 2: update 4K level EPT entry
> > +* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
> > +* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
>
> Ooh, the problem isn't that two SEAMCALLs to the same entry get out of order, it's
> that SEAMCALLs operating on child entries can race ahead of the parent. Hrm.
>
> TDX also has the annoying-but-understandable restriction that leafs need to be
> removed before parents. A not-yet-relevant complication on that front is that the
> TDP MMU's behavior of recursively removing children means we also have to worry
> about PRESENT => !PRESENT transitions, e.g. zapping a child because the parent is
> being removed can race with a different vCPU try to populate the child (because
> the vCPU handling a page fault could have seen the PRESENT parent).

Yes.

>
> I think there's an opportunity and motivation to improve the TDP MMU as a whole on
> this front though. Rather than recursively zap children in handle_removed_pt(),
> we can use the RCU callback to queue the page table for removal. Setting the parent
> (target page table) !PRESENT and flushing the TLBs ensures that all children are
> unreachable, i.e. KVM doesn't need to immediately set children !PRESENT. Unlike
> the shadow MMU, which maintains a hash table of shadow pages, once a parent page
> table is removed from the TDP MMU, its children are unreachabled.

Do you mean something like (pseudo):

rcu_callback(&removed_sp->rcu_head, handle_removed_pt);

>
> The RCU callback must run in near-constant time, but that's easy to solve as we
> already have a workqueue for zapping page tables, i.e. the RCU callback can simply
> add the target page to the zap workqueue. That would also allow for a (very minor)
> simplification of other TDP MMU code: tdp_mmu_zap_root() wouldn't needed to zap in
> two passes since zapping children of the top-level SPTEs would be deferred to the
> workqueue.

Do you mean zapping the entire page table (from root) doesn't need to be in RCU
read-critical section, but can/should be done after grace period? I think this
makes sense since zapping entire root must happen when root is already invalid,
which cannot be used anymore when the new faults come in?

>
> Back to TDX, to play nice with the restriction that parents are removed only after
> children are removed, I believe KVM can use TDH.MEM.RANGE.BLOCK to make the parent
> !PRESENT. That will effectively prune the S-EPT entry and all its children, and
> the RCU callback will again ensure all in-flight SEAMCALLs for the children complete
> before KVM actually tries to zap the children.

Reading the spec, it seems TDH.MEM.RANGE.BLOCK only sets the Secure EPT entry
which points to the entire range as "blocked", but won't go down until leaf to
mark all EPT entries as "blocked", which makes sense anyway.

But it seems TDH.MEM.PAGE.REMOVE and TDH.MEM.SEPT.REMOVE both only checks
whether that target EPT entry is "blocked", but doesn't check whether any parent
has been marked as "blocked". Not sure whether this will be a problem. But
anyway if this is a problem, we perhaps can get TDX module to fix.

>
> And if we rework zapping page tables, I suspect we can also address David's concern
> (and my not-yet-voiced concern) about polluting the TDP MMU code with logic that is
> necessary only for S-EPT (freezing SPTEs before populating them). Rather than update
> S-EPT _after_ the TDP MMU SPTE, do the S-EPT update first, i.e. invoke the KVM TDX
> hook before try_cmpxchg64() (or maybe instead of?). That way KVM TDX can freeze the
> to-be-installed SPTE without common TDP MMU needing to be aware of the change.

I don't quite understand how putting SEAMCALL before the try_cmpxchg64() can
work. Let's say one thread is populating a mapping and another is zapping it.
The populating thread makes SEAMCALL successfully but then try_cmpxchg64()
fails, in this case how to proceed?

>
> > ( I guess such material will be more useful in the comment. And perhaps we can
> > get rid of the "TDX TDP MMU design doc" patch in this series at least for now as
> > probably nobody will look at it :-) )
>
> Please keep the design doc, I'll definitely read it. I'm just chasing too many
> things at the moment and haven't given the TDX series a proper review, i.e. haven't
> even glanced through all of the patches or even the shortlog.