Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved

From: Dan Williams
Date: Mon Nov 11 2019 - 19:51:42 EST


On Mon, Nov 11, 2019 at 10:27 AM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:
>
> On Fri, Nov 08, 2019 at 06:00:46PM -0800, Dan Williams wrote:
> > On Fri, Nov 8, 2019 at 5:43 PM Sean Christopherson
> > <sean.j.christopherson@xxxxxxxxx> wrote:
> > > On Thu, Nov 07, 2019 at 07:58:46AM -0800, Sean Christopherson wrote:
> > > > Insertion into KVM's secondary MMU is mutually exclusive with an invalidate
> > > > from the mmu_notifier. KVM holds a reference to the to-be-inserted page
> > > > until the page has been inserted, which ensures that the page is pinned and
> > > > thus won't be invalidated until after the page is inserted. This prevents
> > > > an invalidate from racing with insertion. Dropping the reference
> > > > immediately after gup() would allow the invalidate to run prior to the page
> > > > being inserted, and so KVM would map the stale PFN into the guest's page
> > > > tables after it was invalidated in the host.
> > >
> > > My previous analysis is wrong, although I did sort of come to the right
> > > conclusion.
> > >
> > > The part that's wrong is that KVM does not rely on pinning a page/pfn when
> > > installing the pfn into its secondary MMU (guest page tables). Instead,
> > > KVM keeps track of mmu_notifier invalidate requests and cancels insertion
> > > if an invalidate occured at any point between the start of hva_to_pfn(),
> > > i.e. the get_user_pages() call, and acquiring KVM's mmu lock (which must
> > > also be grabbed by mmu_notifier invalidate). So for any pfn, regardless
> > > of whether it's backed by a struct page, KVM inserts a pfn if and only if
> > > it is guaranteed to get an mmu_notifier invalidate for the pfn (and isn't
> > > already invalidated).
> > >
> > > In the page fault flow, KVM doesn't care whether or not the pfn remains
> > > valid in the associated vma. In other words, Dan's idea of immediately
> > > doing put_page() on ZONE_DEVICE pages would work for *page faults*...
> > >
> > > ...but not for all the other flows where KVM uses gfn_to_pfn(), and thus
> > > get_user_pages(). When accessing entire pages of guest memory, e.g. for
> > > nested virtualization, KVM gets the page associated with a gfn, maps it
> > > with kmap() to get a kernel address and keeps the mapping/page until it's
> > > done reading/writing the page. Immediately putting ZONE_DEVICE pages
> > > would result in use-after-free scenarios for these flows.
> >
> > Thanks for this clarification. I do want to put out though that
> > ZONE_DEVICE pages go idle, they don't get freed. As long as KVM drops
> > its usage on invalidate it's perfectly fine for KVM to operate on idle
> > ZONE_DEVICE pages. The common case is that ZONE_DEVICE pages are
> > accessed and mapped while idle. Only direct-I/O temporarily marks them
> > busy to synchronize with invalidate. KVM obviates that need by
> > coordinating with mmu-notifiers instead.
>
> Only the KVM MMU, e.g. page fault handler, coordinates via mmu_notifier,
> the kvm_vcpu_map() case would continue using pages across an invalidate.
>
> Or did I misunderstand?

Not sure *I* understand KVM's implementation.

[ Note, I already acked the solution since it fixes the problem at
hand (thanks btw!), so feel free to drop this discussion if you don't
have time to hit me with the clue bat ]

An elevated page reference count for file mapped pages causes the
filesystem (for a dax mode file) to wait for that reference count to
drop to 1 before allowing the truncate to proceed. For a page cache
backed file mapping (non-dax) the reference count is not considered in
the truncate path. It does prevent the page from getting freed in the
page cache case, but the association to the file is lost for truncate.

As long as any memory the guest expects to be persistent is backed by
mmu-notifier coordination we're all good, otherwise an elevated
reference count does not coordinate with truncate in a reliable way.