Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Yu Zhang
Date: Tue Aug 24 2021 - 06:48:34 EST


On Mon, Aug 23, 2021 at 05:52:48PM -0700, Sean Christopherson wrote:

Thanks a lot for sharing these ideas. Lots of questions are inlined below. :)

> The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
> game, on an acceptable direction for supporting guest private memory, e.g. for
> Intel's TDX. The TDX architectural effectively allows KVM guests to crash the
> host if guest private memory is accessible to host userspace, and thus does not

What about incorrect/malicious accesses from host kernel? Should the direct mapping
also be removed for guest private memory?

> play nice with KVM's existing approach of pulling the pfn and mapping level from
> the host page tables.
>
> This is by no means a complete patch; it's a rough sketch of the KVM changes that
> would be needed. The kernel side of things is completely omitted from the patch;
> the design concept is below.
>
> There's also fair bit of hand waving on implementation details that shouldn't
> fundamentally change the overall ABI, e.g. how the backing store will ensure
> there are no mappings when "converting" to guest private.
>
> Background
> ==========
>
> This is a loose continuation of Kirill's RFC[*] to support TDX guest private
> memory by tracking guest memory at the 'struct page' level. This proposal is the
> result of several offline discussions that were prompted by Andy Lutomirksi's
> concerns with tracking via 'struct page':
>
> 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association,
> let alone a 1:1 pfn:gfn mapping.

May I ask why? Doesn't FOLL_GUEST in Kirill's earlier patch work? Or just
because traversing the host PT to get a PFN(for a PageGuest(page)) is too
heavy?

>
> 2. Does not work for memory that isn't backed by 'struct page', e.g. if devices
> gain support for exposing encrypted memory regions to guests.

Do you mean that a page not backed by 'struct page' might be mapped to other
user space? I thought the VM_GUEST flags for the VMA could prevent that(though
I may possiblely be wrong). Could you explain more? Thanks!

>
> 3. Does not help march toward page migration or swap support (though it doesn't
> hurt either).
>
> [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@xxxxxxxxxxxxxxx
>
> Concept
> =======
>
> Guest private memory must be backed by an "enlightened" file descriptor, where
> "enlightened" means the implementing subsystem supports a one-way "conversion" to
> guest private memory and provides bi-directional hooks to communicate directly
> with KVM. Creating a private fd doesn't necessarily have to be a conversion, e.g. it
> could also be a flag provided at file creation, a property of the file system itself,
> etc...
>
> Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a
> KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM and the fd's
> subsystem exchange a set of function pointers to allow KVM to call into the subsystem,
> e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM,
> e.g. to invalidate/move/swap a gfn range.

So the gfn->pfn translation is done by the fd's subsystem? Again, could you
please elaborate how?

And each private memory region would need a seperate group of callbacks?

>
> Mapping a private fd in host userspace is disallowed, i.e. there is never a host
> virtual address associated with the fd and thus no userspace page tables pointing
> at the private memory.
>
> Pinning _from KVM_ is not required. If the backing store supports page migration
> and/or swap, it can query the KVM-provided function pointers to see if KVM supports
> the operation. If the operation is not supported (this will be the case initially
> in KVM), the backing store is responsible for ensuring correct functionality.
>
> Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback
> from the backing store to KVM. KVM will employ techniques similar to those it uses
> for mmu_notifiers to ensure the guest cannot access freed memory.
>
> A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme,
> existing backing stores can be englightened, a from-scratch implementations is not
> required (though would obviously be possible as well).
>
> One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is
> to add F_SEAL_GUEST, which would convert the entire file to guest private memory
> and either fail if the current size is non-zero or truncate the size to zero.

Have you discussed memfd_secret(if host direct mapping is also to be removed)?

And how does this F_SEAL_GUEST work?

>
> KVM
> ===
>
> Guest private memory is managed as a new address space, i.e. as a different set of
> memslots, similar to how KVM has a separate memory view for when a guest vCPU is
> executing in virtual SMM. SMM is mutually exclusive with guest private memory.
>
> The fd (the actual integer) is provided to KVM when a private memslot is added
> via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned pairing occurs.

My understanding of KVM_SET_USER_MEMORY_REGION is that, this ioctl is to
facilitate the binding of HVA and GPA ranges. But if there's no HVAs for
a private region at all, why do we need a memslot for it? Besides to keep
track of the private GFN ranges, and provide the callbacks, is there any
other reason?

Another question is: why do we need a whole new address space, instead of
one address space accommodating memslot types?

>
> By default, KVM memslot lookups will be "shared", only specific touchpoints will
> be modified to work with private memslots, e.g. guest page faults. All host
> accesses to guest memory, e.g. for emulation, will thus look for shared memory
> and naturally fail without attempting copy_to/from_user() if the guest attempts

Becasue gfn_to_hva() will fail first?

> to coerce KVM into access private memory. Note, avoiding copy_to/from_user() and
> friends isn't strictly necessary, it's more of a happy side effect.
>
> A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run
> is added to propagate illegal accesses (see above) and implicit conversions

Sorry, illegal accesses from VM?

Do you actually mean a KVM page fault caused by private access from VM, which
implicitly notifies KVM to mark it as private(e.g. by bouncing to Qemu, which
then creates a private memory region and ioctls into KVM)?

If the answer is yes, how about naming the exit reason as KVM_EXIT_MEMORY_PRIVATE?
Meanwhile, is Qemu also supposed to invoke some system call into host kernel
before ioctls into KVM? I'm still confused where the kernel callbacks like
the gfn_to_pfn() come from(and how they function)... :)

> to userspace (see below). Note, the new exit reason + struct can also be to
> support several other feature requests in KVM[1][2].
>
> The guest may explicitly or implicity request KVM to map a shared/private variant
> of a GFN. An explicit map request is done via hypercall (out of scope for this
> proposal as both TDX and SNP ABIs define such a hypercall). An implicit map request
> is triggered simply by the guest accessing the shared/private variant, which KVM
> sees as a guest page fault (EPT violation or #NPF). Ideally only explicit requests
> would be supported, but neither TDX nor SNP require this in their guest<->host ABIs.

Well, I am wondering, should we assume all guest pages as shared or private by
default? I mean, if all guest pages are private when the VM is created, maybe
the private memslots can be initialized in VM creation time, and be deleted/splited
later(e.g. in response to guest sharing hypercalls)?

It may simplify the logic, but may also restrict the VM type(e.g. to be TD guest).

>
> For implicit or explicit mappings, if a memslot is found that fully covers the
> requested range (which is a single gfn for implicit mappings), KVM's normal guest
> page fault handling works with minimal modification.
>
> If a memslot is not found, for explicit mappings, KVM will exit to userspace with
> the aforementioned dedicated exit reason. For implict _private_ mappings, KVM will
> also immediately exit with the same dedicated reason. For implicit shared mappings,
> an additional check is required to differentiate between emulated MMIO and an
> implicit private->shared conversion[*]. If there is an existing private memslot
> for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an
> emulated MMIO access and handle the page fault accordingly.
>
> [1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@xxxxxxxxxx
> [2] https://lkml.kernel.org/r/20200617230052.GB27751@xxxxxxxxxxxxxxx
>
> Punching Holes
> ==============
>
> The expected userspace memory model is that mapping requests will be handled as
> conversions, e.g. on a shared mapping request, first unmap the private gfn range,
> then map the shared gfn range. A new KVM ioctl() will likely be needed to allow
> userspace to punch a hole in a memslot, as expressing such an operation isn't
> possible with KVM_SET_USER_MEMORY_REGION. While userspace could delete the
> memslot, then recreate three new memslots, doing so would be destructive to guest
> data as unmapping guest private memory (from the EPT/NPT tables) is destructive
> to the data for both TDX and SEV-SNP guests.

May I ask why? Thanks!

>
> Pros (vs. struct page)
> ======================
>
> Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.
>
> Userspace page tables are not populated, e.g. reduced memory footprint, lower
> probability of making private memory accessible to userspace.
>
> Provides line of sight to supporting page migration and swap.
>
> Provides line of sight to mapping MMIO pages into guest private memory.
>
> Cons (vs. struct page)
> ======================
>
> Significantly more churn in KVM, e.g. to plumb 'private' through where needed,
> support memslot hole punching, etc...
>
> KVM's MMU gets another method of retrieving host pfn and page size.

And the method is provided by host kernel? How does this method work?

[...]
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a272ccbddfa1..771080235b2d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> + if (memslot_is_private(slot))
> + return slot->private_ops->pfn_mapping_level(...);
> +

Oh, any suggestion how host kernel decides the mapping level here?

> host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> return min(host_level, max_level);
> }
> @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
> static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> {
> - struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
> + struct kvm_memory_slot *slot;
> bool async;
>
> + slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private);
> +
> /*
> * Retry the page fault if the gfn hit a memslot that is being deleted
> * or moved. This ensures any existing SPTEs for the old memslot will
> @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> goto out_retry;
>
> + /*
> + * Exit to userspace to map the requested private/shared memory region
> + * if there is no memslot and (a) the access is private or (b) there is
> + * an existing private memslot. Emulated MMIO must be accessed through
> + * shared GPAs, thus a memslot miss on a private GPA is always handled
> + * as an implicit conversion "request".
> + */

For (b), do you mean this fault is for a GFN which marked as private, but now
converted to a shared? If true, could we just disallow it if no explict sharing
hypercall is triggered?

> + if (!slot &&
> + (fault->private || __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, true)))
> + goto out_convert;
> +
> if (!kvm_is_visible_memslot(slot)) {
> - /* Don't expose private memslots to L2. */
> + /* Don't expose KVM's internal memslots to L2. */
> if (is_guest_mode(vcpu)) {
> fault->pfn = KVM_PFN_NOSLOT;
> fault->map_writable = false;
> @@ -3890,6 +3906,12 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> out_retry:
> *r = RET_PF_RETRY;
> return true;
> +
> +out_convert:
> + vcpu->run->exit_reason = KVM_EXIT_MAP_MEMORY;
> + /* TODO: fill vcpu->run with more info. */
> + *r = 0;
> + return true;
> }


B.R.
Yu