Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

From: Gavin Shan
Date: Tue Jun 13 2023 - 20:02:18 EST


Hi Sean,

On 6/14/23 12:58 AM, Sean Christopherson wrote:
On Tue, Jun 13, 2023, Gavin Shan wrote:
On 6/13/23 12:41 AM, Sean Christopherson wrote:
Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")

This Fixes isn't correct. That change only affected x86, which doesn't have this
bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
notifier callbacks"), arm64 did NOT skip invalid slots

...

Unless I'm missing something, this goes all the way back to initial arm64 support
added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").


The fixes tag was sorted out based on 'git-bisect', not static code analysis. I
agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from
the code. From the 'git-bisect', we found the issue disappears when the head is
commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code").
And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based
MMU notifier callbacks").

No, just because bisect points at a commit doesn't mean that's the commit that
introduced a bug. It's not uncommon for a completely valid change to expose a
pre-existing bug, which is what I suspect happened here, e.g. after switching to
the generic framework, clean_dcache_guest_page() is called *after* retrieving the
memslot, versus clean_dcache_guest_page() being called before walking memslots.

Blaming the correct commit matters, both so that it's clear to future readers what
went wrong, and also so that people maintaining older kernels at least are aware
that there kernel may be affected. E.g. a fix in generic KVM obviously won't
backport to 5.10, but someone who cares deeply about a 5.10-based kernel on arm64
could manually port the fix to KVM arm64 code.


Yes, the change of the call site on clean_dcache_guest_page() enlarges the race
condition window between memory slot removal and change_pte() theoretically.
With this, we were able to reproduce the issue in a practical test. In next
revision, lets put a note about this in the changelog, like below.

We tried a git-bisect and the first problematic commit is cd4c71835228 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). Without this commit, clean_dcache_guest_page()
is called before the iteration on the memory slots. This change literally
enlarges the racy window between kvm_mmu_notifier_change_pte() and memory
slot removal so that we're able to reproduce the issue in a practical test
case. However, the issue exists since d5d8184d35c9 ("KVM: ARM: Memory
virtualization setup").
Cc: stable@xxxxxxxxxxxxxxx # v3.9+
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")

I'm declined to fix the issue only for ARM64,

I never suggested that an ARM-only fix be made, in fact I explicitly suggested
the exact opposite.


Well, the original patch ignores the invalid memory slot for all MMU notifiers.
You suggested to have this ignorance only for change_pte(). The scope is decreased
and I was following this direction to narrow this issue/fix to change_pte() on
ARM64 only. Sorry that I misunderstood your suggestion.

more details are given below. If we're going to limit the issue to ARM64 and
fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets
correct it in next revision with:

For a generic fix, just blame multiple commits, e.g. the guilty arm64, RISC-V,
and MIPS commits, which at a glance are the affected architectures. At one point
in time, x86 was also likely affected, but that's probably not worth more than a
brief mention in the changelog as I don't see any value in tracking down a very
theoretical window of time where x86 was affected.


Right. Please refer to above reply. I plan to add a section in the changelog
in next revision.

Cc: stable@xxxxxxxxxxxxxxx # v3.9+
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")

Reported-by: Shuai Hu <hshuai@xxxxxxxxxx>
Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
virt/kvm/kvm_main.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..7f81a3a209b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+ if (slot->flags & KVM_MEMSLOT_INVALID)
+ continue;

Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start()
comes along between installing the invalid slot and zapping SPTEs, KVM will
return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
references to the range.

I.e.

kvm_copy_memslot(invalid_slot, old);
invalid_slot->flags |= KVM_MEMSLOT_INVALID;
kvm_replace_memslot(kvm, old, invalid_slot);

/*
* Activate the slot that is now marked INVALID, but don't propagate
* the slot to the now inactive slots. The slot is either going to be
* deleted or recreated as a new slot.
*/
kvm_swap_active_memslots(kvm, old->as_id);


==> invalidate_range_start()

/*
* From this point no new shadow pages pointing to a deleted, or moved,
* memslot will be created. Validation of sp->gfn happens in:
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
* - kvm_is_visible_gfn (mmu_check_root)
*/
kvm_arch_flush_shadow_memslot(kvm, old);

And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
unmap the prior SPTE. Of course, that can't happen in the current code base
because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
count is elevated in .change_pte()" for details). That's also why this doesn't
show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
SPTE is present, which is never true due to the invalidation.


Right, those architectural dependencies are really something I worried about.

The zap case isn't a architecture specific, that's true for all KVM architectures.


Yes.

It's safe to skip the invalid memory slots for ARM64,

No, it's not. The problematic window where an invalidation can be incorrectly
skipped is very tiny, and would have zero chance of being seen without something
generating invalidations, e.g. page migration. But that doesn't mean there's no
bug.


Yeah, Agree.

but it may be unsafe to do so for other architectures. You've listed the
potential risks to do so for x86. It might be risky with PowerPC's reverse
mapping stuff either. I didn't look into the code for the left architectures.
In order to escape from the architectural conflicts, I would move the check
and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(),
something like below. In this way, the guest hang issue in ARM64 guest is
fixed. We may have similar issue on other architectures, but we can figure
out the fix when we have to. Sean, please let me know if you're happy with
this?

arch/arm64/kvm/mmu.c::kvm_set_spte_gfn()
----------------------------------------

bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_pfn_t pfn = pte_pfn(range->pte);

if (!kvm->arch.mmu.pgt)
return false;

/*
* The memory slot can become invalid temporarily or permanently
* when it's being moved or deleted. Avoid the stage2 mapping so
* that all the read and write requests to the region of the memory
* slot can be forwarded to VMM and emulated there.
*/
if (range->slot->flags & KVM_MEMSLOT_INVALID)
return false;

Please no. (a) it papers over common KVM's reliance on the SPTE being zapped by
invalidate_range_start(), and (b) it leaves the same bug in RISC-V (which copy+pasted
much of arm64's MMU code) and in MIPS (which also installs SPTEs in its callback).


Ok. Now I understand your suggestion to fix the issue for architectures. I will post
v3 with your suggested changes.

WARN_ON(range->end - range->start != 1);

:
}

I'd honestly love to just delete the change_pte() callback, but my opinion is more
than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best
option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
memslot, but with a sanity check and comment explaining the dependency on there
being no SPTEs due to the invalidation. E.g.


It's good idea, but I'm afraid other architectures like PowerPC will still be
affected.

I don't follow. PPC unmaps in response to a PTE change, but that's effectively
dead code due to change_pte() being wrapped with invalidate_range_{start,end}().


You're right. I forgot the fact that change_pte() is surrounded by invalidate_{start,end}().

So I would like to limit this issue to ARM64 and fix it in its
kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM":
could you please share more information about this? I'm blindly guessing
you're saying KSM isn't used in google cloud?

Yeah, we == Google/GCE. Sorry, should have clarified that.


Ok, thanks for your clarification.

Thanks,
Gavin