Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

From: Yan Zhao
Date: Tue Jul 01 2025 - 20:15:20 EST

Next message: Chang S. Bae: "Re: [RFC PATCH V2 03/13] x86/fpu/xstate: Add xsaves_nmi"
Previous message: Pranjal Shrivastava: "Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support"
In reply to: Edgecombe, Rick P: "Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE"
Next in thread: Edgecombe, Rick P: "Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jul 01, 2025 at 11:36:22PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-01 at 10:41 +0800, Yan Zhao wrote:
> > > Can you explain what you found regarding the write lock need?
> > Here, the write lock protects 2 steps:
> > (1) update lpage_info.
> > (2) try splitting if there's any existing 2MB mapping.
> >
> > The write mmu_lock is needed because lpage_info is read under read mmu_lock in
> > kvm_tdp_mmu_map().
> >
> > kvm_tdp_mmu_map
> > kvm_mmu_hugepage_adjust
> > kvm_lpage_info_max_mapping_level
> >
> > If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
> > stale 2MB level even after lpage_info is updated by
> > hugepage_set_guest_inhibit().
> >
> > Therefore, we must perform splitting under the write mmu_lock to ensure there
> > are no 2MB mappings after hugepage_set_guest_inhibit().
> >
> > Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
> > fault path could be triggered as KVM MMU finds the goal level is 4KB while an
> > existing 2MB mapping is present.
>
> It could be?
> 1. mmu read lock
> 2. update lpage_info
> 3. mmu write lock upgrade
> 4. demote
> 5. mmu unlock
>
> Then (3) could be skipped in the case of ability to demote under read lock?
>
> I noticed that the other lpage_info updaters took mmu write lock, and I wasn't
> sure why. We shouldn't take a lock that we don't actually need just for safety
> margin or to copy other code.
Use write mmu_lock is of reason.

In the 3 steps,
1. set lpage_info
2. demote if needed
3. go to fault handler

Step 2 requires holding write mmu_lock before invoking kvm_split_boundary_leafs().
The write mmu_lock is also possible to get dropped and re-acquired in
kvm_split_boundary_leafs() for purpose like memory allocation.

If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at 2MB
level after the demote in step 2.
Luckily, current TDX doesn't support promotion now.
But we can avoid wading into this complex situation by holding write mmu_lock
in 1.

> > > For most accept
> > > cases, we could fault in the PTE's on the read lock. And in the future we
> > > could
> >
> > The actual mapping at 4KB level is still with read mmu_lock in
> > __vmx_handle_ept_violation().
> >
> > > have a demote that could work under read lock, as we talked. So
> > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > lock
> > > when needed.
> > Could we leave the "demote under read lock" as a future optimization?
>
> We could add it to the list. If we have a TDX module that supports demote with a
> single SEAMCALL then we don't have the rollback problem. The optimization could
> utilize that. That said, we should focus on the optimizations that make the
> biggest difference to real TDs. Your data suggests this might not be the case
> today.
Ok.

> > > What is the problem in hugepage_set_guest_inhibit() that requires the write
> > > lock?
> > As above, to avoid the other vCPUs reading stale mapping level and splitting
> > under read mmu_lock.
>
> We need mmu write lock for demote, but as long as the order is:
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
>
> Then (3) should have what it needs even if another fault races (1).
See the above comment for why we need to hold write mmu_lock for 1.

Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock before
walking the SPTEs to check if there's any existing mapping), I don't see any
performance impact by holding write mmu_lock for 1.

> > As guest_inhibit is set one-way, we could test it using
> > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > write
> > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > (in my testing, 11 during VM boot).
> >
> > > But in any case, it seems like we have *a* solution here. It doesn't seem
> > > like
> > > there are any big downsides. Should we close it?
> > I think it's good, as long as Sean doesn't disagree :)
>
> He seemed onboard. Let's close it. We can even discuss lpage_info update locking
> on v2.
Ok. I'll use write mmu_lock for updating lpage_info in v2 first.

Next message: Chang S. Bae: "Re: [RFC PATCH V2 03/13] x86/fpu/xstate: Add xsaves_nmi"
Previous message: Pranjal Shrivastava: "Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support"
In reply to: Edgecombe, Rick P: "Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE"
Next in thread: Edgecombe, Rick P: "Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]