Re: [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root

From: Paolo Bonzini
Date: Wed Mar 02 2022 - 09:54:52 EST


On 3/2/22 03:13, Sean Christopherson wrote:
After typing up all of the below, I actually tried the novel idea of compiling
the code... and we can't do xchg() on role.invalid because it occupies a single
bit, it's not a standalone boolean.

Yeah I thought the same right after sending the email, but I'll just say it
was pseudocode. :) We can do

static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
{
union kvm_mmu_page_role role = page->role;
role.invalid = true;

/* No need to use cmpxchg, only the invalid bit can change. */
role.word = xchg(&page->role.word, role.word);
return role.invalid;
}

Either way, barriers or xchg, it needs to be a separate function.

by using refcount_dec_not_one() above, there's no guarantee that this
task is the last one to see it as kvm_tdp_mmu_get_root() can succeed
and bump the refcount between refcount_dec_not_one() and here.
Yep, I agree refcount_dec_and_test is needed.

Based on my own version, I guess you mean (without comments due to family
NMI):

if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;

if (!xchg(&root->role.invalid, true) {
refcount_set(&root->tdp_mmu_root_count, 1);
tdp_mmu_zap_root(kvm, root, shared);
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;
}

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);

That would work, though I'd prefer to recurse on kvm_tdp_mmu_put_root() instead
of open coding refcount_dec_and_test() so that we get coverage of the xchg()
doing the right thing.

I still slightly prefer having the "free" path be inside the xchg(). To me, even
though the "free" path is the only one that's guaranteed to be reached for every root,
the fall-through to resetting the refcount and zapping the root is the "normal" path,
and the "free" path is the exception.

Hmm I can see how that makes especially sense once you add in the worker logic.
But it seems to me that the "basic" logic should be "do both the xchg and the
free", and coding the function with tail recursion obfuscates this. Even with
the worker, you grow an

+ if (kvm_get_kvm_safe(kvm)) {
+ ... let the worker do it ...
+ return;
+ }
+
tdp_mmu_zap_root(kvm, root, shared);

but you still have a downwards flow that matches what happens even if multiple
threads pick up different parts of the job.

So, I tried a bunch of alternatives including with gotos and with if/else, but
really the above one remains my favorite.

My plan would be:

1) splice the mini series I'm attaching before this patch, and
remove patch 1 of this series. next_invalidated_root() is a
bit yucky, but notably it doesn't need to add/remove a reference
in kvm_tdp_mmu_zap_invalidated_roots().

Together, these two steps ensure that readers never acquire a
reference to either refcount=0/valid or invalid pages". In other
words, the three states of that kvm_tdp_mmu_put_root moves the root
through (refcount=0/valid -> refcount=0/invalid -> refcount=1/invalid)
are exactly the same to readers, and there are essentially no races
to worry about.

In other other words, it's trading slightly uglier code for simpler
invariants.

2) here, replace the change to kvm_tdp_mmu_put_root with the following:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b6ffa91fb9d7..aa0669f54d96 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -81,6 +81,16 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared);
+static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
+{
+ union kvm_mmu_page_role role = page->role;
+ role.invalid = true;
+
+ /* No need to use cmpxchg, only the invalid bit can change. */
+ role.word = xchg(&page->role.word, role.word);
+ return role.invalid;
+}
+
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -91,20 +101,44 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
WARN_ON(!root->tdp_mmu_page);
+ /*
+ * The root now has refcount=0 and is valid. Readers cannot acquire
+ * a reference to it (they all visit valid roots only, except for
+ * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
+ * any reference itself.
+ *
+ * Even though there are flows that need to visit all roots for
+ * correctness, they all take mmu_lock for write, so they cannot yet
+ * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
+ * since the root still has refcount=0.
+ *
+ * However, tdp_mmu_zap_root can yield, and writers do not expect to
+ * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
+ * So the root temporarily gets an extra reference, going to refcount=1
+ * while staying invalid. Readers still cannot acquire any reference;
+ * but writers are now allowed to run if tdp_mmu_zap_root yields and
+ * they might take an extra reference is they themselves yield. Therefore,
+ * when the reference is given back after tdp_mmu_zap_root terminates,
+ * there is no guarantee that the refcount is still 1. If not, whoever
+ * puts the last reference will free the page, but they will not have to
+ * zap the root because a root cannot go from invalid to valid.
+ */
+ if (!kvm_tdp_root_mark_invalid(root)) {
+ refcount_set(&root->tdp_mmu_root_count, 1);
+ tdp_mmu_zap_root(kvm, root, shared);
+
+ /*
+ * Give back the reference that was added back above. We now
+ * know that the root is invalid, so go ahead and free it if
+ * no one has taken a reference in the meanwhile.
+ */
+ if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
+ return;
+ }
+
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-
- /*
- * A TLB flush is not necessary as KVM performs a local TLB flush when
- * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
- * to a different pCPU. Note, the local TLB flush on reuse also
- * invalidates any paging-structure-cache entries, i.e. TLB entries for
- * intermediate paging structures, that may be zapped, as such entries
- * are associated with the ASID on both VMX and SVM.
- */
- tdp_mmu_zap_root(kvm, root, shared);
-
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

3) for the worker patch, the idea would be

+static void tdp_mmu_zap_root_work(struct work_struct *work)
+{
+ ...
+}
+
+
+static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+ root->tdp_mmu_async_data = kvm;
+ INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
+ schedule_work(&root->tdp_mmu_async_work);
+}
+
static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
{
union kvm_mmu_page_role role = page->role;
@@ -125,13 +165,24 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
*/
if (!kvm_tdp_root_mark_invalid(root)) {
refcount_set(&root->tdp_mmu_root_count, 1);
- tdp_mmu_zap_root(kvm, root, shared);
/*
- * Give back the reference that was added back above. We now
+ * If the struct kvm is alive, we might as well zap the root
+ * in a worker. The worker takes ownership of the reference we
+ * have just added to root as well as the new reference to kvm.
+ */
+ if (kvm_get_kvm_safe(kvm)) {
+ tdp_mmu_schedule_zap_root(kvm, root);
+ return;
+ }
+
+ /*
+ * The struct kvm is being destroyed, zap synchronously and give
+ * back immediately the reference that was added above. We now
* know that the root is invalid, so go ahead and free it if
* no one has taken a reference in the meanwhile.
*/
+ tdp_mmu_zap_root(kvm, root, shared);
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;
}


Again, I appreciate the idea behind the recursive call, but I think
overall it's clearer to have a clear flow from the beginning to the
end of the function, with the exceptions and optimizations noted as
early returns.

Let me know what you think. Tomorrow I have a day off, but later
today I will have my changes tested and pushed to kvm/queue for you
to look at.

Thanks,

PaoloFrom f54d23ec258da7c631c7b059048fb383b8255079 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@xxxxxxxxxx>
Date: Wed, 2 Mar 2022 08:44:22 -0500
Subject: [PATCH 1/2] KVM: x86/mmu: only perform eager page splitting on valid
roots

Eager page splitting is an optimization; it does not have to be performed on
invalid roots. By only operating on the valid roots, this removes the
only case in which a reader might acquire a reference to an invalid root.

Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
---
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 941e25fd14bd..e3b104448e14 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1541,7 +1541,7 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,

kvm_lockdep_assert_mmu_lock_held(kvm, shared);

- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, shared) {
+ for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, shared) {
r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
if (r) {
kvm_tdp_mmu_put_root(kvm, root, shared);
--
2.31.1


From 95557483c2a32386cb0769a61f838550d9467434 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@xxxxxxxxxx>
Date: Wed, 2 Mar 2022 08:51:05 -0500
Subject: [PATCH 2/2] KVM: x86/mmu: do not allow readers to acquire references
to invalid roots

Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
that readers do not ever acquire a reference to an invalid root. After this
patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
same way. kvm_tdp_mmu_zap_invalidated_roots() is different but it also
does not acquire a reference to the invalid root, and it cannot see
refcount=0/invalid because it is guaranteed to run after
kvm_tdp_mmu_invalidate_all_roots().

Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
---
arch/x86/kvm/mmu/tdp_mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e3b104448e14..2e935edd3f6c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -171,8 +171,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)

-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
- __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
+ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, false, false)

/*
* Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
@@ -879,7 +879,8 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
{
struct kvm_mmu_page *root;

- for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);

return flush;
@@ -895,8 +896,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
* is being destroyed or the userspace VMM has exited. In both cases,
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
*/
+ lockdep_assert_held_write(&kvm->mmu_lock);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- for_each_tdp_mmu_root_yield_safe(kvm, root, i, false)
+ for_each_tdp_mmu_root_yield_safe(kvm, root, i)
tdp_mmu_zap_root(kvm, root, false);
}
}
--
2.31.1