Re: [PATCH 20/24] KVM: x86/mmu: Use a dedicated bit to track shadow/MMU-present SPTEs

From: Sean Christopherson
Date: Mon Mar 08 2021 - 15:12:39 EST


On Mon, Mar 08, 2021, Tom Lendacky wrote:
> On 2/25/21 2:47 PM, Sean Christopherson wrote:
> > Introduce MMU_PRESENT to explicitly track which SPTEs are "present" from
> > the MMU's perspective. Checking for shadow-present SPTEs is a very
> > common operation for the MMU, particularly in hot paths such as page
> > faults. With the addition of "removed" SPTEs for the TDP MMU,
> > identifying shadow-present SPTEs is quite costly especially since it
> > requires checking multiple 64-bit values.
> >
> > On 64-bit KVM, this reduces the footprint of kvm.ko's .text by ~2k bytes.
> > On 32-bit KVM, this increases the footprint by ~200 bytes, but only
> > because gcc now inlines several more MMU helpers, e.g. drop_parent_pte().
> >
> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > ---
> > arch/x86/kvm/mmu/spte.c | 8 ++++----
> > arch/x86/kvm/mmu/spte.h | 11 ++++++++++-
> > 2 files changed, 14 insertions(+), 5 deletions(-)
>
> I'm trying to run a guest on my Rome system using the queue branch, but
> I'm encountering an error that I bisected to this commit. In the guest
> (during OVMF boot) I see:
>
> error: kvm run failed Invalid argument
> RAX=0000000000000000 RBX=00000000ffc12792 RCX=000000007f58401a RDX=000000007faaf808
> RSI=0000000000000010 RDI=00000000ffc12792 RBP=00000000ffc12792 RSP=000000007faaf740
> R8 =0000000000000792 R9 =000000007faaf808 R10=00000000ffc12793 R11=00000000000003f8
> R12=0000000000000010 R13=0000000000000000 R14=000000007faaf808 R15=0000000000000012
> RIP=000000007f6e9a90 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
> FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
> GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
> GDT= 000000007f5ee698 00000047
> IDT= 000000007f186018 00000fff
> CR0=80010033 CR2=0000000000000000 CR3=000000007f801000 CR4=00000668
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000d00
> Code=22 00 00 e8 c0 e6 ff ff 48 83 c4 20 45 84 ed 74 07 fb eb 04 <44> 88 65 00 58 5b 5d 41 5c 41 5d c3 55 48 0f af 3d 1b 37 00 00 be 20 00 00 00 48 03 3d 17
>
> On the hypervisor, I see the following:
>
> [ 55.886136] get_mmio_spte: detect reserved bits on spte, addr 0xffc12792, dump hierarchy:
> [ 55.895284] ------ spte 0x1344a0827 level 4.
> [ 55.900059] ------ spte 0x134499827 level 3.
> [ 55.904877] ------ spte 0x165bf0827 level 2.
> [ 55.909651] ------ spte 0xff800ffc12817 level 1.

Ah fudge. I know what's wrong. The MMIO generation uses bit 11, which means
is_shadow_present_pte() can get a false positive on high MMIO generations. This
particular error can be squashed by explicitly checking for MMIO sptes in
get_mmio_spte(), but I'm guessing there are other flows where a false positive
is fatal (probably the __pte_list_remove bug below...). The safe thing to do is
to steal bit 11 from MMIO SPTEs so that shadow present PTEs are the only thing
that sets that bit.

I'll reproduce by stuffing the MMIO generation and get a patch posted. Sorry :-/

> When I kill the guest, I get a kernel panic:
>
> [ 95.539683] __pte_list_remove: 0000000040567a6a 0->BUG
> [ 95.545481] kernel BUG at arch/x86/kvm/mmu/mmu.c:896!