Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

From: Yan Zhao
Date: Tue Sep 05 2023 - 22:21:22 EST


On Tue, Sep 05, 2023 at 01:18:23PM -0700, Sean Christopherson wrote:
> On Mon, Sep 04, 2023, Yan Zhao wrote:
> > ...
> > > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > > the first place. No sane guest should blindly read (or execute) uninitialized
> > > > memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > > been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off
> > > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > > execution attacks these days).
> > > I'm running a linux guest.
> > > KSM is not turned on both in guest and host.
> > > Both guest and host have turned on transparent huge page.
> > >
> > > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > > which will cause
> > > (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> > > mapped.
> > > (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> > > which will fail
> > >
> > > The guest then writes the GFN.
> > > This step will trigger (huge pmd split for huge page case) and .change_pte().
> > >
> > > My guest is surely a sane guest. But currently I can't find out why
> > > certain pages are read before write.
> > > Will return back to you the reason after figuring it out after my long vacation.
> > Finally I figured out the reason.
> >
> > Except 4 pages were read before written from vBIOS (I just want to skip finding
> > out why vBIOS does this), the remaining thousands of pages were read before
> > written from the guest Linux kernel.
> >
> > If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
> > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
> > "init_on_free=1", this read before written problem goes away.
> >
> > However, turning on those configs has side effects as said in kernel config
> > message:
> > "all page allocator and slab allocator memory will be zeroed when allocated,
> > eliminating many kinds of "uninitialized heap memory" flaws, especially
> > heap content exposures. The performance impact varies by workload, but most
> > cases see <1% impact. Some synthetic workloads have measured as high as 7%."
> >
> > If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
> > the root cause for all the reads of uninitialized heap memory are related to
>
> Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a
> real solution.
>
> > page cache pages of the guest virtual devices (specifically the virtual IDE
> > device in my case).
>
> Why are you using IDE? IDE is comically slow compared to VirtIO, and VirtIO has
> been broadly supported for something like 15 years, even on Windows.

I don't know why I'm using IDE.
Maybe just because it's of my default settings for years :)

And I sent this series was just because I think each guest write in this case has
to cause two EPT violations is wasted.

BTW, not only for IDE devices, I think any devices with DMA mask less than max GPA
width will cause the same problem. And also true when "swiotlb=force" is enabled.

>
> > The reason for this unconditional read of page into bounce buffer
> > (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> > is explained in the code:
> >
> > /*
> > * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
> > * to the tlb buffer, if we knew for sure the device will
> > * overwrite the entire current content. But we don't. Thus
> > * unconditional bounce may prevent leaking swiotlb content (i.e.
> > * kernel memory) to user-space.
> > */
> >
> > If we neglect this risk and do changes like
> > - swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> > + if (dir != DMA_FROM_DEVICE)
> > + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> >
> > the issue of pages read before written from guest kernel just went away.
> >
> > I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> > content, if target page content is not copied firstly to the swiotlb's
> > bounce buffer, then the bounce buffer needs to be initialized to 0.
> > However, swiotlb_tbl_map_single() does not know whether the target page
> > is initialized or not. Then, it would cause page content to be trimmed
> > if device does not overwrite the entire memory.
> >
> > >
> > > >
> > > > If there's something else going on, i.e. if your VM really is somehow generating
> > > > reads before writes, and if we really want to optimize use cases that can't use
> > > > hugepages for whatever reason, I would much prefer to do something like add a
> > > > memslot flag to state that the memslot should *always* be mapped writable. Because
> > > Will check if this flag is necessary after figuring out the reason.
> > As explained above, I think it's a valid and non-rare practice in guest kernel to
> > cause read of uninitialized heap memory.
>
> Heh, for some definitions of valid.
>
> > And the host admin may not know exactly when it's appropriate to apply the
> > memslot flag.
>
> Yeah, a memslot flag is too fine-grained.
>
> > Do you think it's good to make the "always write_fault = true" solution enabled
> > by default?
>
> Sadly, probably not, because that would regress setups that do want to utilize
> CoW, e.g. I'm pretty sure requesting everything to be writable would be a big
> negative for KSM.
>
> I do think we should add a KVM knob though. Regardless of the validity or frequency
> of the guest behavior, and even though userspace can also workaround this by
> preallocating guest memory, I am struggling to think of any reason outside of KSM
> where CoW semantics are desirable.
>
> Ooh, actually, maybe we could do
>
> static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);
>
> and then cross our fingers that that doesn't regress some other funky setups.
Oh, this "always write_fault" solution may be not friendly to UFFD write protected pages too.