Re: [RFC 00/16] KVM protected memory extension

From: Kirill A. Shutemov
Date: Mon May 25 2020 - 01:27:13 EST


On Fri, May 22, 2020 at 03:51:58PM +0300, Kirill A. Shutemov wrote:
> == Background / Problem ==
>
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.

CC people who worked on the related patchsets.

> == What does this set mitigate? ==
>
> - Host kernel âaccidentalâ access to guest data (think speculation)
>
> - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>
> - Host userspace access to guest data (compromised qemu)
>
> == What does this set NOT mitigate? ==
>
> - Full host kernel compromise. Kernel will just map the pages again.
>
> - Hardware attacks
>
>
> The patchset is RFC-quality: it works but has known issues that must be
> addressed before it can be considered for applying.
>
> We are looking for high-level feedback on the concept. Some open
> questions:
>
> - This protects from some kernel and host userspace read-only attacks,
> but does not place the host kernel outside the trust boundary. Is it
> still valuable?
>
> - Can this approach be used to avoid cache-coherency problems with
> hardware encryption schemes that repurpose physical bits?
>
> - The guest kernel must be modified for this to work. Is that a deal
> breaker, especially for public clouds?
>
> - Are the costs of removing pages from the direct map too high to be
> feasible?
>
> == Series Overview ==
>
> The hardware features protect guest data by encrypting it and then
> ensuring that only the right guest can decrypt it. This has the
> side-effect of making the kernel direct map and userspace mapping
> (QEMU et al) useless. But, this teaches us something very useful:
> neither the kernel or userspace mappings are really necessary for normal
> guest operations.
>
> Instead of using encryption, this series simply unmaps the memory. One
> advantage compared to allowing access to ciphertext is that it allows bad
> accesses to be caught instead of simply reading garbage.
>
> Protection from physical attacks needs to be provided by some other means.
> On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> mitigation against physical attacks, such as DIMM interposers sniffing
> memory bus traffic.
>
> The patchset modifies both host and guest kernel. The guest OS must enable
> the feature via hypercall and mark any memory range that has to be shared
> with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> bit in the guestâs page table while this approach uses a hypercall.
>
> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
>
> Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> flushes local TLB. I think it's a reasonable compromise between security and
> perfromance.
>
> Zapping the PTE would bring the page back to the direct mapping after clearing.
> At least for now, we don't remove file-backed pages from the direct mapping.
> File-backed pages could be accessed via read/write syscalls. It adds
> complexity.
>
> Occasionally, host kernel has to access guest memory that was not made
> shared by the guest. For instance, it happens for instruction emulation.
> Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> helpers acquire the page via GUP, map it into kernel address space with
> kmap_atomic()-style mechanism and only then copy the data.
>
> For some instruction emulation copying is not good enough: cmpxchg
> emulation has to have direct access to the guest memory. __kvm_map_gfn()
> is modified to accommodate the case.
>
> The patchset is on top of v5.7-rc6 plus this patch:
>
> https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@xxxxxxxxx
>
> == Open Issues ==
>
> Unmapping the pages from direct mapping bring a few of issues that have
> not rectified yet:
>
> - Touching direct mapping leads to fragmentation. We need to be able to
> recover from it. I have a buggy patch that aims at recovering 2M/1G page.
> It has to be fixed and tested properly
>
> - Page migration and KSM is not supported yet.
>
> - Live migration of a guest would require a new flow. Not sure yet how it
> would look like.
>
> - The feature interfere with NUMA balancing. Not sure yet if it's
> possible to make them work together.
>
> - Guests have no mechanism to ensure that even a well-behaving host has
> unmapped its private data. With SEV, for instance, the guest only has
> to trust the hardware to encrypt a page after the C bit is set in a
> guest PTE. A mechanism for a guest to query the host mapping state, or
> to constantly assert the intent for a page to be Private would be
> valuable.
--
Kirill A. Shutemov