[PATCH v2 00/17] KVM: x86/mmu: Page fault and MMIO cleanups

From: Paolo Bonzini
Date: Tue May 07 2024 - 11:59:39 EST


This is an updated version of the series at
https://patchew.org/linux/20240228024147.41573-1-seanjc@xxxxxxxxxx/
which has been used for SEV-SNP development, taking into account
all review comments. Patch 8 is the only completely new patch
(and even then it had been posted together with the various
TDX/SNP prep series).

Here is an annotated git range-diff:

==============================================================================
KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation
- tweak commit message, add comment

@@ Commit message
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
triggers emulation of any kind, as KVM doesn't currently support emulating
access to guest private memory. Practically speaking, private faults and
- emulation are already mutually exclusive, but there are edge cases upon
- edge cases where KVM can return RET_PF_EMULATE, and adding one last check
- to harden against weird, unexpected combinations is inexpensive.
+ emulation are already mutually exclusive, but there are many flow that
+ can result in KVM returning RET_PF_EMULATE, and adding one last check
+ to harden against weird, unexpected combinations and/or KVM bugs is
+ inexpensive.

Suggested-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
@@ arch/x86/kvm/mmu/mmu_internal.h: static inline int kvm_mmu_do_page_fault(struct
else
r = vcpu->arch.mmu->page_fault(vcpu, &fault);

++ /*
++ * Not sure what's happening, but punt to userspace and hope that
++ * they can fix it by changing memory to shared, or they can
++ * provide a better error.
++ */
+ if (r == RET_PF_EMULATE && fault.is_private) {
++ pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
+ kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
+ return -EFAULT;
+ }


==============================================================================
KVM: x86: Remove separate "bit" defines for page fault error code masks
- do not use ilog2

@@ Commit message
just to see which flag corresponds to which bit is quite annoying, as is
having to define two macros just to add recognition of a new flag.

- Use ilog2() to derive the bit in permission_fault(), the one function that
- actually needs the bit number (it does clever shifting to manipulate flags
- in order to avoid conditional branches).
+ Use ternary operator to derive the bit in permission_fault(), the one
+ function that actually needs the bit number as part of clever shifting
+ to avoid conditional branches. Generally the compiler is able to turn
+ it into a conditional move, and if not it's not really a big deal.

No functional change intended.

@@ arch/x86/kvm/mmu.h: static inline u8 permission_fault(struct kvm_vcpu *vcpu, str
u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
- int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
-+ int index = (pfec + (not_smap << ilog2(PFERR_RSVD_MASK))) >> 1;
++ int index = (pfec | (not_smap ? PFERR_RSVD_MASK : 0)) >> 1;
u32 errcode = PFERR_PRESENT_MASK;
bool fault;

@@ arch/x86/kvm/mmu.h: static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+ pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;

/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
- offset = (pfec & ~1) +
+- offset = (pfec & ~1) +
- ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
-+ ((pte_access & PT_USER_MASK) << (ilog2(PFERR_RSVD_MASK) - PT_USER_SHIFT));
++ offset = (pfec & ~1) | ((pte_access & PT_USER_MASK) ? PFERR_RSVD_MASK : 0);

pkru_bits &= mmu->pkru_mask >> offset;
errcode |= -pkru_bits & PFERR_PK_MASK;

==============================================================================
KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
- match commit message to bits defined in header file

@@ Commit message
Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
guests, as specified by the APM:

+ * Bit 31 (RMP): Set to 1 if the fault was caused due to an RMP check or a
+ VMPL check failure, 0 otherwise.
* Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
* Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
* Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission
check failure, 0 otherwise.
- * Bit 37 (SSS): Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
- enabled.

Note, the APM is *extremely* misleading, and strongly implies that the
above flags can _only_ be set for #NPF exits from SNP guests. That is a


==============================================================================
KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
- add a description of PFERR_IMPLICIT_ACCESS

@@ Commit message
Add a compile-time assert in the legacy #PF handler to make sure that KVM-
define flags are covered by its existing sanity check on the upper bits.

+ Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we
+ are removing the comment that defined it.
+
Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
+ Reviewed-by: Kai Huang <kai.huang@xxxxxxxxx>
+ Reviewed-by: Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx>
Message-ID: <20240228024147.41573-8-seanjc@xxxxxxxxxx>
+ Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
+
+ ## arch/x86/include/asm/kvm_host.h ##
+@@ arch/x86/include/asm/kvm_host.h: enum x86_intercept_stage;
+ #define PFERR_GUEST_ENC_MASK BIT_ULL(34)
+ #define PFERR_GUEST_SIZEM_MASK BIT_ULL(35)
+ #define PFERR_GUEST_VMPL_MASK BIT_ULL(36)
++
++/*
++ * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP checks
++ * when emulating instructions that triggers implicit access.
++ */
+ #define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
++#define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS)
+
+ #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
+ PFERR_WRITE_MASK | \

## arch/x86/kvm/mmu/mmu.c ##
@@ arch/x86/kvm/mmu/mmu.c: int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
- if (WARN_ON_ONCE(error_code >> 32))
- error_code = lower_32_bits(error_code);
+ return -EFAULT;
+ #endif

+ /* Ensure the above sanity check also covers KVM-defined flags. */
+ BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));

- move in front of the other synthetic page fault error code patches

==============================================================================
KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
commit message copy editing

@@ Commit message
and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
32-bit field.

- Simply drop the upper bits of hardware provides garbage, as spurious
+ Simply drop the upper bits if hardware provides garbage, as spurious
information should do no harm (though in all likelihood hardware is buggy
and the kernel is doomed).

@@ Commit message
which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
PFERR_GUEST_ENC_MASK without running afoul of the sanity check.

- Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
+ Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
bit minimizes the probability of a collision with the "other" vendor,
without needing to plumb more bits through microcode.


==============================================================================
KVM: x86/mmu: Use synthetic page fault error code to indicate private faults
- patch reordering, no other changes

==============================================================================
KVM: x86/mmu: check for invalid async page faults involving private memory
- new patch coming from TDX/SNP prep series; test PFERR_PRIVATE_ACCESS, set arch.error_code

@@ arch/x86/kvm/mmu/mmu.c: static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
arch.token = alloc_apf_token(vcpu);
- arch.gfn = gfn;
+ arch.gfn = fault->gfn;
++ arch.error_code = fault->error_code;
arch.direct_map = vcpu->arch.mmu->root_role.direct;
arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);

@@ arch/x86/kvm/mmu/mmu.c: static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
{
int r;

-+ if (WARN_ON_ONCE(work->arch.error_code & PFERR_GUEST_ENC_MASK))
++ if (WARN_ON_ONCE(work->arch.error_code & PFERR_PRIVATE_ACCESS))
+ return;
+
if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||


==============================================================================
KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
- exit to userspace if the wrong case happens, test PFERR_PRIVATE_ACCESS

@@ Commit message

## arch/x86/kvm/mmu/mmu.c ##
@@ arch/x86/kvm/mmu/mmu.c: int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
- error_code |= PFERR_PRIVATE_ACCESS;

r = RET_PF_INVALID;
-- if (unlikely(error_code & PFERR_RSVD_MASK)) {
-+ if (unlikely((error_code & PFERR_RSVD_MASK) &&
-+ !WARN_ON_ONCE(error_code & PFERR_GUEST_ENC_MASK))) {
+ if (unlikely(error_code & PFERR_RSVD_MASK)) {
++ if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
++ return -EFAULT;
++
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
if (r == RET_PF_EMULATE)
goto emulate;

==============================================================================
KVM: x86/mmu: Move private vs. shared check above slot validity checks
- add comment about use of mmu_invalidate_seq

@@ arch/x86/kvm/mmu/mmu.c: static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, stru
return kvm_faultin_pfn_private(vcpu, fault);

@@ arch/x86/kvm/mmu/mmu.c: static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ {
+ int ret;
+
++ /*
++ * Note that the mmu_invalidate_seq also serves to detect a concurrent
++ * change in attributes. is_page_fault_stale() will detect an
++ * invalidation relate to fault->fn and resume the guest without
++ * installing a mapping in the page tables.
++ */
fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
smp_rmb();

+ /*
-+ * Check for a private vs. shared mismatch *after* taking a snapshot of
-+ * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
-+ * invalidation notifier.
++ * Now that we have a snapshot of mmu_invalidate_seq we can check for a
++ * private vs. shared mismatch.
+ */
+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);



==============================================================================
KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()
- differences in moved code, range-diff is unreadable


==============================================================================
KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()
- remove unnecessary change

@@ arch/x86/kvm/mmu/mmu.c: static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct
return kvm_handle_noslot_fault(vcpu, fault, access);

/*
-
- ## arch/x86/kvm/mmu/mmu_internal.h ##
-@@ arch/x86/kvm/mmu/mmu_internal.h: struct kvm_page_fault {
- /* The memslot containing gfn. May be NULL. */
- struct kvm_memory_slot *slot;
-
-- /* Outputs of kvm_faultin_pfn. */
-+ /* Outputs of kvm_faultin_pfn. */
- unsigned long mmu_seq;
- kvm_pfn_t pfn;
- hva_t hva;


Isaku Yamahata (1):
KVM: x86/mmu: Pass full 64-bit error code when handling page faults

Paolo Bonzini (1):
KVM: x86/mmu: check for invalid async page faults involving private
memory

Sean Christopherson (15):
KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits
emulation
KVM: x86: Remove separate "bit" defines for page fault error code
masks
KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are
non-zero
KVM: x86/mmu: Use synthetic page fault error code to indicate private
faults
KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page
faults
KVM: x86/mmu: Move private vs. shared check above slot validity checks
KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC
internal slots
KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIO
KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to
kvm_faultin_pfn()
KVM: x86/mmu: Handle no-slot faults at the beginning of
kvm_faultin_pfn()
KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot"
faults
KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create
noslot pfns

arch/x86/include/asm/kvm_host.h | 46 ++++----
arch/x86/kvm/mmu.h | 5 +-
arch/x86/kvm/mmu/mmu.c | 182 ++++++++++++++++++++------------
arch/x86/kvm/mmu/mmu_internal.h | 28 ++++-
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/svm/svm.c | 9 ++
6 files changed, 174 insertions(+), 98 deletions(-)

--
2.43.0