Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

From: David Hildenbrand
Date: Tue Aug 31 2021 - 15:08:41 EST


On 27.08.21 04:31, Yu Zhang wrote:
On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
On 24.08.21 02:52, Sean Christopherson wrote:
The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
game, on an acceptable direction for supporting guest private memory, e.g. for
Intel's TDX. The TDX architectural effectively allows KVM guests to crash the
host if guest private memory is accessible to host userspace, and thus does not
play nice with KVM's existing approach of pulling the pfn and mapping level from
the host page tables.

This is by no means a complete patch; it's a rough sketch of the KVM changes that
would be needed. The kernel side of things is completely omitted from the patch;
the design concept is below.

There's also fair bit of hand waving on implementation details that shouldn't
fundamentally change the overall ABI, e.g. how the backing store will ensure
there are no mappings when "converting" to guest private.


This is a lot of complexity and rather advanced approaches (not saying they
are bad, just that we try to teach the whole stack something completely
new).


What I think would really help is a list of requirements, such that
everybody is aware of what we actually want to achieve. Let me start:

GFN: Guest Frame Number
EPFN: Encrypted Physical Frame Number


1) An EPFN must not get mapped into more than one VM: it belongs exactly to
one VM. It must neither be shared between VMs between processes nor between
VMs within a processes.


2) User space (well, and actually the kernel) must never access an EPFN:

- If we go for an fd, essentially all operations (read/write) have to
fail.
- If we have to map an EPFN into user space page tables (e.g., to
simplify KVM), we could only allow fake swap entries such that "there
is something" but it cannot be accessed and is flagged accordingly.
- /proc/kcore and friends have to be careful as well and should not read
this memory. So there has to be a way to flag these pages.

3) We need a way to express the GFN<->EPFN mapping and essentially assign an
EPFN to a GFN.


4) Once we assigned a EPFN to a GFN, that assignment must not longer change.
Further, an EPFN must not get assigned to multiple GFNs.


5) There has to be a way to "replace" encrypted parts by "shared" parts
and the other way around.

What else?

Thanks a lot for this summary. A question about the requirement: do we or
do we not have plan to support assigned device to the protected VM?

Good question, I assume that is stuff for the far far future.


If yes. The fd based solution may need change the VFIO interface as well(
though the fake swap entry solution need mess with VFIO too). Because:

1> KVM uses VFIO when assigning devices into a VM.

2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
guest pages will have to be mapped in host IOMMU page table to host pages,
which are pinned during the whole life cycle fo the VM.

3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
in vfio_dma_do_map().

4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
and pin the page.

But if we are using fd based solution, not every GPA can have a HVA, thus
the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
doubt if VFIO can be modified to support this easily.

I fully agree. Maybe Intel folks have some idea how that's supposed to look like in the future.

--
Thanks,

David / dhildenb