Re: [PATCH 3/5] KVM: gmem: Hold filemap invalidate lock while allocating/preparing folios
From: Yan Zhao
Date: Thu Jun 12 2025 - 08:43:47 EST
On Tue, Jun 03, 2025 at 11:28:35PM -0700, Vishal Annapurve wrote:
> On Mon, Jun 2, 2025 at 6:34 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> >
> > On Mon, Jun 02, 2025 at 06:05:32PM -0700, Vishal Annapurve wrote:
> > > On Tue, May 20, 2025 at 11:49 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, May 19, 2025 at 10:04:45AM -0700, Ackerley Tng wrote:
> > > > > Ackerley Tng <ackerleytng@xxxxxxxxxx> writes:
> > > > >
> > > > > > Yan Zhao <yan.y.zhao@xxxxxxxxx> writes:
> > > > > >
> > > > > >> On Fri, Mar 14, 2025 at 05:20:21PM +0800, Yan Zhao wrote:
> > > > > >>> This patch would cause host deadlock when booting up a TDX VM even if huge page
> > > > > >>> is turned off. I currently reverted this patch. No further debug yet.
> > > > > >> This is because kvm_gmem_populate() takes filemap invalidation lock, and for
> > > > > >> TDX, kvm_gmem_populate() further invokes kvm_gmem_get_pfn(), causing deadlock.
> > > > > >>
> > > > > >> kvm_gmem_populate
> > > > > >> filemap_invalidate_lock
> > > > > >> post_populate
> > > > > >> tdx_gmem_post_populate
> > > > > >> kvm_tdp_map_page
> > > > > >> kvm_mmu_do_page_fault
> > > > > >> kvm_tdp_page_fault
> > > > > >> kvm_tdp_mmu_page_fault
> > > > > >> kvm_mmu_faultin_pfn
> > > > > >> __kvm_mmu_faultin_pfn
> > > > > >> kvm_mmu_faultin_pfn_private
> > > > > >> kvm_gmem_get_pfn
> > > > > >> filemap_invalidate_lock_shared
> > > > > >>
> > > > > >> Though, kvm_gmem_populate() is able to take shared filemap invalidation lock,
> > > > > >> (then no deadlock), lockdep would still warn "Possible unsafe locking scenario:
> > > > > >> ...DEADLOCK" due to the recursive shared lock, since commit e918188611f0
> > > > > >> ("locking: More accurate annotations for read_lock()").
> > > > > >>
> > > > > >
> > > > > > Thank you for investigating. This should be fixed in the next revision.
> > > > > >
> > > > >
> > > > > This was not fixed in v2 [1], I misunderstood this locking issue.
> > > > >
> > > > > IIUC kvm_gmem_populate() gets a pfn via __kvm_gmem_get_pfn(), then calls
> > > > > part of the KVM fault handler to map the pfn into secure EPTs, then
> > > > > calls the TDX module for the copy+encrypt.
> > > > >
> > > > > Regarding this lock, seems like KVM'S MMU lock is already held while TDX
> > > > > does the copy+encrypt. Why must the filemap_invalidate_lock() also be
> > > > > held throughout the process?
> > > > If kvm_gmem_populate() does not hold filemap invalidate lock around all
> > > > requested pages, what value should it return after kvm_gmem_punch_hole() zaps a
> > > > mapping it just successfully installed?
> > > >
> > > > TDX currently only holds the read kvm->mmu_lock in tdx_gmem_post_populate() when
> > > > CONFIG_KVM_PROVE_MMU is enabled, due to both slots_lock and the filemap
> > > > invalidate lock being taken in kvm_gmem_populate().
> > >
> > > Does TDX need kvm_gmem_populate path just to ensure SEPT ranges are
> > > not zapped during tdh_mem_page_add and tdh_mr_extend operations? Would
> > > holding KVM MMU read lock during these operations sufficient to avoid
> > > having to do this back and forth between TDX and gmem layers?
> > I think the problem here is because in kvm_gmem_populate(),
> > "__kvm_gmem_get_pfn(), post_populate(), and kvm_gmem_mark_prepared()"
> > must be wrapped in filemap invalidate lock (shared or exclusive), right?
> >
> > Then, in TDX's post_populate() callback, the filemap invalidate lock is held
> > again by kvm_tdp_map_page() --> ... ->kvm_gmem_get_pfn().
>
> I am contesting the need of kvm_gmem_populate path altogether for TDX.
> Can you help me understand what problem does kvm_gmem_populate path
> help with for TDX?
There is a long discussion on the list about this.
Basically TDX needs 3 steps for KVM_TDX_INIT_MEM_REGION.
1. Get the PFN
2. map the mirror page table
3. invoking tdh_mem_page_add().
Holding filemap invalidation lock around the 3 steps helps ensure that the PFN
passed to tdh_mem_page_add() is a valid one.
Rather then revisit it, what about fixing the contention more simply like this?
Otherwise we can revisit the history.
(The code is based on Ackerley's branch
https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2, with patch "HACK: filemap_invalidate_lock() only for getting the pfn" reverted).
commit d71956718d061926e5d91e5ecf60b58a0c3b2bad
Author: Yan Zhao <yan.y.zhao@xxxxxxxxx>
Date: Wed Jun 11 18:17:26 2025 +0800
KVM: guest_memfd: Use shared filemap invalidate lock in kvm_gmem_populate()
Convert kvm_gmem_populate() to use shared filemap invalidate lock. This is
to avoid deadlock caused by kvm_gmem_populate() further invoking
tdx_gmem_post_populate() which internally acquires shared filemap
invalidate lock in kvm_gmem_get_pfn().
To avoid lockep warning by nested shared filemap invalidate lock,
avoid holding shared filemap invalidate lock in kvm_gmem_get_pfn() when
lockdep is enabled.
Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 784fc1834c04..ccbb7ceb978a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -2393,12 +2393,16 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
struct file *file = kvm_gmem_get_file(slot);
struct folio *folio;
bool is_prepared = false;
+ bool get_shared_lock;
int r = 0;
if (!file)
return -EFAULT;
- filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
+ get_shared_lock = !IS_ENABLED(CONFIG_LOCKDEP) ||
+ !lockdep_is_held(&file_inode(file)->i_mapping->invalidate_lock);
+ if (get_shared_lock)
+ filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
if (IS_ERR(folio)) {
@@ -2423,7 +2427,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
else
folio_put(folio);
out:
- filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
+ if (get_shared_lock)
+ filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
fput(file);
return r;
}
@@ -2536,7 +2541,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
if (!file)
return -EFAULT;
- filemap_invalidate_lock(file->f_mapping);
+ filemap_invalidate_lock_shared(file->f_mapping);
npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
for (i = 0; i < npages; i += npages_to_populate) {
@@ -2587,7 +2592,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
break;
}
- filemap_invalidate_unlock(file->f_mapping);
+ filemap_invalidate_unlock_shared(file->f_mapping);
fput(file);
return ret && !i ? ret : i;
If it looks good to you, then for the in-place conversion version of
guest_memfd, there's one remaining issue left: an AB-BA lock issue between the
shared filemap invalidate lock and mm->mmap_lock, i.e.,
- In path kvm_gmem_fault_shared(),
the lock sequence is mm->mmap_lock --> filemap_invalidate_lock_shared(),
- while in path kvm_gmem_populate(),
the lock sequence is filemap_invalidate_lock_shared() -->mm->mmap_lock.
We can fix it with below patch. The downside of the this patch is that it
requires userspace to initialize all source pages passed to TDX, which I'm not
sure if everyone likes it. If it cannot land, we still have another option:
disallow the initial memory regions to be backed by the in-place conversion
version of guest_memfd. If this can be enforced, then we can resolve the issue
by annotating the lockdep, indicating that kvm_gmem_fault_shared() and
kvm_gmem_populate() cannot occur on the same guest_memfd, so the two shared
filemap invalidate locks in the two paths are not the same.
Author: Yan Zhao <yan.y.zhao@xxxxxxxxx>
Date: Wed Jun 11 18:23:00 2025 +0800
KVM: TDX: Use get_user_pages_fast_only() in tdx_gmem_post_populate()
Convert get_user_pages_fast() to get_user_pages_fast_only()
in tdx_gmem_post_populate().
Unlike get_user_pages_fast(), which will acquire mm->mmap_lock and fault in
physical pages after it finds the pages have not already faulted in or have
been zapped/swapped out, get_user_pages_fast_only() returns directly in
such cases.
Using get_user_pages_fast_only() can avoid tdx_gmem_post_populate()
acquiring mm->mmap_lock, which may cause AB, BA lockdep warning with the
shared filemap invalidate lock when guest_memfd in-place conversion is
supported. (In path kvm_gmem_fault_shared(), the lock sequence is
mm->mmap_lock --> filemap_invalidate_lock_shared(), while in path
kvm_gmem_populate(), the lock sequence is filemap_invalidate_lock_shared()
-->mm->mmap_lock).
Besides, using get_user_pages_fast_only() and returning directly to
userspace if a page is not present in the primary PTE can help detect a
careless case that the source pages are not initialized by userspace.
As initial memory region bypasses guest acceptance, copying an
uninitialized source page to guest could be harmful and undermine the page
measurement.
Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 93c31eecfc60..462390dddf88 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3190,9 +3190,10 @@ static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
* Get the source page if it has been faulted in. Return failure if the
* source page has been swapped out or unmapped in primary memory.
*/
- ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
+ ret = get_user_pages_fast_only((unsigned long)src, 1, 0, &src_page);
if (ret < 0)
return ret;
+
if (ret != 1)
return -ENOMEM;