Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

From: Sean Christopherson
Date: Tue Sep 05 2023 - 16:19:00 EST


On Mon, Sep 04, 2023, Yan Zhao wrote:
> ...
> > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > the first place. No sane guest should blindly read (or execute) uninitialized
> > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off
> > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > execution attacks these days).
> > I'm running a linux guest.
> > KSM is not turned on both in guest and host.
> > Both guest and host have turned on transparent huge page.
> >
> > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > which will cause
> > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> > mapped.
> > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> > which will fail
> >
> > The guest then writes the GFN.
> > This step will trigger (huge pmd split for huge page case) and .change_pte().
> >
> > My guest is surely a sane guest. But currently I can't find out why
> > certain pages are read before write.
> > Will return back to you the reason after figuring it out after my long vacation.
> Finally I figured out the reason.
>
> Except 4 pages were read before written from vBIOS (I just want to skip finding
> out why vBIOS does this), the remaining thousands of pages were read before
> written from the guest Linux kernel.
>
> If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
> "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
> "init_on_free=1", this read before written problem goes away.
>
> However, turning on those configs has side effects as said in kernel config
> message:
> "all page allocator and slab allocator memory will be zeroed when allocated,
> eliminating many kinds of "uninitialized heap memory" flaws, especially
> heap content exposures. The performance impact varies by workload, but most
> cases see <1% impact. Some synthetic workloads have measured as high as 7%."
>
> If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
> the root cause for all the reads of uninitialized heap memory are related to

Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a
real solution.

> page cache pages of the guest virtual devices (specifically the virtual IDE
> device in my case).

Why are you using IDE? IDE is comically slow compared to VirtIO, and VirtIO has
been broadly supported for something like 15 years, even on Windows.

> The reason for this unconditional read of page into bounce buffer
> (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> is explained in the code:
>
> /*
> * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
> * to the tlb buffer, if we knew for sure the device will
> * overwrite the entire current content. But we don't. Thus
> * unconditional bounce may prevent leaking swiotlb content (i.e.
> * kernel memory) to user-space.
> */
>
> If we neglect this risk and do changes like
> - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> + if (dir != DMA_FROM_DEVICE)
> + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
>
> the issue of pages read before written from guest kernel just went away.
>
> I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> content, if target page content is not copied firstly to the swiotlb's
> bounce buffer, then the bounce buffer needs to be initialized to 0.
> However, swiotlb_tbl_map_single() does not know whether the target page
> is initialized or not. Then, it would cause page content to be trimmed
> if device does not overwrite the entire memory.
>
> >
> > >
> > > If there's something else going on, i.e. if your VM really is somehow generating
> > > reads before writes, and if we really want to optimize use cases that can't use
> > > hugepages for whatever reason, I would much prefer to do something like add a
> > > memslot flag to state that the memslot should *always* be mapped writable. Because
> > Will check if this flag is necessary after figuring out the reason.
> As explained above, I think it's a valid and non-rare practice in guest kernel to
> cause read of uninitialized heap memory.

Heh, for some definitions of valid.

> And the host admin may not know exactly when it's appropriate to apply the
> memslot flag.

Yeah, a memslot flag is too fine-grained.

> Do you think it's good to make the "always write_fault = true" solution enabled
> by default?

Sadly, probably not, because that would regress setups that do want to utilize
CoW, e.g. I'm pretty sure requesting everything to be writable would be a big
negative for KSM.

I do think we should add a KVM knob though. Regardless of the validity or frequency
of the guest behavior, and even though userspace can also workaround this by
preallocating guest memory, I am struggling to think of any reason outside of KSM
where CoW semantics are desirable.

Ooh, actually, maybe we could do

static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);

and then cross our fingers that that doesn't regress some other funky setups.