On Tue, Jun 13, 2023, Gavin Shan wrote:
On 6/13/23 12:41 AM, Sean Christopherson wrote:
Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
This Fixes isn't correct. That change only affected x86, which doesn't have this
bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
notifier callbacks"), arm64 did NOT skip invalid slots
...
Unless I'm missing something, this goes all the way back to initial arm64 support
added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").
The fixes tag was sorted out based on 'git-bisect', not static code analysis. I
agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from
the code. From the 'git-bisect', we found the issue disappears when the head is
commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code").
And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based
MMU notifier callbacks").
No, just because bisect points at a commit doesn't mean that's the commit that
introduced a bug. It's not uncommon for a completely valid change to expose a
pre-existing bug, which is what I suspect happened here, e.g. after switching to
the generic framework, clean_dcache_guest_page() is called *after* retrieving the
memslot, versus clean_dcache_guest_page() being called before walking memslots.
Blaming the correct commit matters, both so that it's clear to future readers what
went wrong, and also so that people maintaining older kernels at least are aware
that there kernel may be affected. E.g. a fix in generic KVM obviously won't
backport to 5.10, but someone who cares deeply about a 5.10-based kernel on arm64
could manually port the fix to KVM arm64 code.
I'm declined to fix the issue only for ARM64,
I never suggested that an ARM-only fix be made, in fact I explicitly suggested
the exact opposite.
more details are given below. If we're going to limit the issue to ARM64 and
fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets
correct it in next revision with:
For a generic fix, just blame multiple commits, e.g. the guilty arm64, RISC-V,
and MIPS commits, which at a glance are the affected architectures. At one point
in time, x86 was also likely affected, but that's probably not worth more than a
brief mention in the changelog as I don't see any value in tracking down a very
theoretical window of time where x86 was affected.
Cc: stable@xxxxxxxxxxxxxxx # v3.9+
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
Reported-by: Shuai Hu <hshuai@xxxxxxxxxx>
Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
virt/kvm/kvm_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..7f81a3a209b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+ if (slot->flags & KVM_MEMSLOT_INVALID)
+ continue;
Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start()
comes along between installing the invalid slot and zapping SPTEs, KVM will
return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
references to the range.
I.e.
kvm_copy_memslot(invalid_slot, old);
invalid_slot->flags |= KVM_MEMSLOT_INVALID;
kvm_replace_memslot(kvm, old, invalid_slot);
/*
* Activate the slot that is now marked INVALID, but don't propagate
* the slot to the now inactive slots. The slot is either going to be
* deleted or recreated as a new slot.
*/
kvm_swap_active_memslots(kvm, old->as_id);
==> invalidate_range_start()
/*
* From this point no new shadow pages pointing to a deleted, or moved,
* memslot will be created. Validation of sp->gfn happens in:
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
* - kvm_is_visible_gfn (mmu_check_root)
*/
kvm_arch_flush_shadow_memslot(kvm, old);
And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
unmap the prior SPTE. Of course, that can't happen in the current code base
because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
count is elevated in .change_pte()" for details). That's also why this doesn't
show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
SPTE is present, which is never true due to the invalidation.
Right, those architectural dependencies are really something I worried about.
The zap case isn't a architecture specific, that's true for all KVM architectures.
It's safe to skip the invalid memory slots for ARM64,
No, it's not. The problematic window where an invalidation can be incorrectly
skipped is very tiny, and would have zero chance of being seen without something
generating invalidations, e.g. page migration. But that doesn't mean there's no
bug.
but it may be unsafe to do so for other architectures. You've listed the
potential risks to do so for x86. It might be risky with PowerPC's reverse
mapping stuff either. I didn't look into the code for the left architectures.
In order to escape from the architectural conflicts, I would move the check
and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(),
something like below. In this way, the guest hang issue in ARM64 guest is
fixed. We may have similar issue on other architectures, but we can figure
out the fix when we have to. Sean, please let me know if you're happy with
this?
arch/arm64/kvm/mmu.c::kvm_set_spte_gfn()
----------------------------------------
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_pfn_t pfn = pte_pfn(range->pte);
if (!kvm->arch.mmu.pgt)
return false;
/*
* The memory slot can become invalid temporarily or permanently
* when it's being moved or deleted. Avoid the stage2 mapping so
* that all the read and write requests to the region of the memory
* slot can be forwarded to VMM and emulated there.
*/
if (range->slot->flags & KVM_MEMSLOT_INVALID)
return false;
Please no. (a) it papers over common KVM's reliance on the SPTE being zapped by
invalidate_range_start(), and (b) it leaves the same bug in RISC-V (which copy+pasted
much of arm64's MMU code) and in MIPS (which also installs SPTEs in its callback).
WARN_ON(range->end - range->start != 1);
:
}
I'd honestly love to just delete the change_pte() callback, but my opinion is more
than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best
option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
memslot, but with a sanity check and comment explaining the dependency on there
being no SPTEs due to the invalidation. E.g.
It's good idea, but I'm afraid other architectures like PowerPC will still be
affected.
I don't follow. PPC unmaps in response to a PTE change, but that's effectively
dead code due to change_pte() being wrapped with invalidate_range_{start,end}().
So I would like to limit this issue to ARM64 and fix it in its
kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM":
could you please share more information about this? I'm blindly guessing
you're saying KSM isn't used in google cloud?
Yeah, we == Google/GCE. Sorry, should have clarified that.