Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

From: David Hildenbrand
Date: Wed Sep 01 2021 - 04:09:20 EST

Next message: Michal Hocko: "Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory"
Previous message: Nicolas Ferre: "Re: [PATCH v3 0/4] mmc: pwrseq: sd8787: add support wilc1000 devices"
In reply to: David Hildenbrand: "Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory"
Next in thread: Andy Lutomirski: "Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Do we have to protect from that? How would KVM protect from user space
replacing private pages by shared pages in any of the models we discuss?

The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
as both private and shared, where "shared" also incorporates any mapping from the
host. Essentially it boils down to the kernel ensuring that a pfn is unmapped
before it's converted to/from private, and KVM ensuring that it honors any
unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
as proposed in this RFC.

Okay, so the fallocate(PUNCHHOLE) from user space could trigger the respective unmapping and freeing of backing storage.

As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
backing-store is destroyed; the backing-store needs to notify downstream MMUs
(a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.

Right.

[*] Whether or not the kernel's direct mapping needs to be removed is debatable,
but my argument is that that behavior is not visible to userspace and thus
out of scope for this discussion, e.g. zapping/restoring the direct map can
be added/removed without impacting the userspace ABI.

Right. Removing it shouldn't also be requited IMHO. There are other ways to teach the kernel to not read/write some online pages (filter /proc/kcore, disable hibernation, strict access checks for /dev/mem ...).

Define "ordinary" user memory slots as overlay on top of "encrypted" memory
slots. Inside KVM, bail out if you encounter such a VMA inside a normal
user memory slot. When creating a "encryped" user memory slot, require that
the whole VMA is covered at creation time. You know the VMA can't change
later.

This can work for the basic use cases, but even then I'd strongly prefer not to
tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind
the virtual address of a memslot, and when it does care, it tends to do poorly,
e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and
that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers
to handle mprotect()/munmap()/etc...

Right, and for the existing use cases this worked. But encrypted memory
breaks many assumptions we once made ...

I have somewhat mixed feelings about pages that are mapped into $WHATEVER
page tables but not actually mapped into user space page tables. There is no
way to reach these via the rmap.

We have something like that already via vfio. And that is fundamentally
broken when it comes to mmu notifiers, page pinning, page migration, ...

I'm not super familiar with VFIO internals, but the idea with the fd-based
approach is that the backing-store would be in direct communication with KVM and
would handle those operations through that direct channel.

Right. The problem I am seeing is that e.g., try_to_unmap() might not be able to actually fully unmap a page, because some non-synchronized KVM MMU still maps a page. It would be great to evaluate how the fd callbacks would fit into the whole picture, including the current rmap.

I guess I'm missing the bigger picture how it all fits together on the !KVM side.

As is, I don't think KVM would get any kind of notification if userpaces unmaps
the VMA for a private memslot that does not have any entries in the host page
tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page
is touched by the backing store, but I don't think the end result would be any
prettier than a dedicated API for KVM to consume.

Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
to page migration or swap. For those types of operations, KVM currently just
reacts to invalidation notifications by zapping guest PTEs, and then gets the
new pfn when the guest re-faults on the page. That sequence doesn't work for
TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
contents, i.e. the host needs to call into KVM for the actual migration.

Right, but I still think this is a kernel internal. You can do such
handshake later in the kernel IMHO.

It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the
migration and that would invert the mmu_notifer API, e.g. instead of "telling"
secondary MMUs to invalidate/change a mappings, the mm would be "asking"
secondary MMus "can you move this?". More below.

In my thinking, the the rmap via mmu notifiers would do the unmapping just as we know it (from primary MMU -> secondary MMU). Once try_to_unmap() succeeded, the fd provider could kick-off the migration via whatever callback.

But I also already thought: is it really KVM that is to perform the
migration or is it the fd-provider that performs the migration? Who says
memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just
knows how to migrate such a page?

I'd love to have some details on how that's supposed to work, and which
information we'd need to migrate/swap/... in addition to the EPFN and a new
SPFN.

KVM "needs" to do the migration. On TDX, the migration will be a SEAMCALL,
a post-VMXON instruction that transfers control to the TDX-Module, that at
minimum needs a per-VM identifier, the gfn, and the page table level. The call

The per-VM identifier and the GFN would be easy to grab. Page table level, not so sure -- do you mean the general page table depth? Or if it's mapped as 4k vs. 2M ... ? The latter could be answered by the fd provider already I assume.

Does the page still have to be mapped into the secondary MMU when performing the migration via TDX? I assume not, which would simplify things a lot.

into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock)
to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to
the backing-store attempting to migrate memory that KVM is unmapping due to a
memslot change.

Something like that might be handled by fixing private memory slots similar to in my draft, right?

The per-VM identifier may not apply to SEV-SNP, but I believe everything else
holds true.

Thanks!

--
Thanks,

David / dhildenb

Next message: Michal Hocko: "Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory"
Previous message: Nicolas Ferre: "Re: [PATCH v3 0/4] mmc: pwrseq: sd8787: add support wilc1000 devices"
In reply to: David Hildenbrand: "Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory"
Next in thread: Andy Lutomirski: "Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]