Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware

From: Paolo Bonzini
Date: Wed Aug 03 2022 - 11:08:38 EST


On 8/2/22 21:12, Sean Christopherson wrote:
Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting
the node for the existing shadow page, but it's actually getting the node for the
underlying physical huge page.

That means this patch is broken now that KVM doesn't require a "struct page" to
create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted
"struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page()
may return garbage.

return page_to_nid(pfn_to_page(spte_to_pfn(spte)));

I was about to say that yesterday. However my knowledge of struct page things has been proved to be spotty enough, that I wasn't 100% sure of that. But yeah, with a fresh mind it's quite obvious that anything that goes through hva_to_pfn_remap and similar paths will fail.

That said, I still don't like this patch, at least not as a one-off thing. I do
like the idea of having page table allocations follow the node requested for the
target pfn, what I don't like is having one-off behavior that silently overrides
userspace policy.

Yes, I totally agree with making it all or nothing.

I would much rather we solve the problem for all page table allocations, maybe
with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change
because it will probably require having a per-node cache in each of the per-vCPU
caches.

A module parameter is fine. If you care about performance to this level, your userspace is probably homogeneous enough.

Paolo

Hmm, actually, a not-awful alternative would be to have the fault path fallback
to the current task's policy if allocation fails. I.e. make it best effort.

E.g. as a rough sketch...

---
arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bf2ccf9debca..e475f5b55794 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -273,10 +273,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,

static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
{
+ struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache;
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp->spt = kvm_mmu_alloc_shadow_page_table(cache, GFP_NOWAIT, pfn);

return sp;
}
@@ -1190,7 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (is_removed_spte(iter.old_spte))
break;

- sp = tdp_mmu_alloc_sp(vcpu);
+ sp = tdp_mmu_alloc_sp(vcpu, fault->pfn);
tdp_mmu_init_child_sp(sp, &iter);

if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
@@ -1402,17 +1403,39 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
return spte_set;
}

-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+void *kvm_mmu_alloc_shadow_page_table(struct kvm_mmu_memory_cache *cache,
+ gfp_t gfp, kvm_pfn_t pfn)
+{
+ struct page *page = kvm_pfn_to_refcounted_page(pfn);
+ struct page *spt_page;
+ int node;
+
+ gfp |= __GFP_ZERO | __GFP_ACCOUNT;
+
+ if (page) {
+ spt_page = alloc_pages_node(page_to_nid(page), gfp, 0);
+ if (spt_page)
+ return page_address(spt_page);
+ } else if (!cache) {
+ return (void *)__get_free_page(gfp);
+ }
+
+ if (cache)
+ return kvm_mmu_memory_cache_alloc(cache);
+
+ return NULL;
+}
+
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp,
+ kvm_pfn_t pfn)
{
struct kvm_mmu_page *sp;

- gfp |= __GFP_ZERO;
-
sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
if (!sp)
return NULL;

- sp->spt = (void *)__get_free_page(gfp);
+ sp->spt = kvm_mmu_alloc_shadow_page_table(NULL, gfp, pfn);
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
return NULL;
@@ -1436,7 +1459,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
* If this allocation fails we drop the lock and retry with reclaim
* allowed.
*/
- sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT);
if (sp)
return sp;


base-commit: f8990bfe1eab91c807ca8fc0d48705e8f986b951
--