Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved

From: Sean Christopherson
Date: Tue Nov 12 2019 - 11:57:20 EST


On Tue, Nov 12, 2019 at 11:19:44AM +0100, Paolo Bonzini wrote:
> On 12/11/19 01:51, Dan Williams wrote:
> > An elevated page reference count for file mapped pages causes the
> > filesystem (for a dax mode file) to wait for that reference count to
> > drop to 1 before allowing the truncate to proceed. For a page cache
> > backed file mapping (non-dax) the reference count is not considered in
> > the truncate path. It does prevent the page from getting freed in the
> > page cache case, but the association to the file is lost for truncate.
>
> KVM support for file-backed guest memory is limited. It is not
> completely broken, in fact cases such as hugetlbfs are in use routinely,
> but corner cases such as truncate aren't covered well indeed.

KVM's actual MMU should be ok since it coordinates with the mmu_notifier.

kvm_vcpu_map() is where KVM could run afoul of page cache truncation.
This is the other main use of hva_to_pfn*(), where KVM directly accesses
guest memory (which could be file-backed) without coordinating with the
mmu_notifier. IIUC, an ill-timed page cache truncation could result in a
write from KVM effectively being dropped due to writeback racing with
KVM's write to the page. If that's true, then I think KVM would need to
to move to the proposed pin_user_pages() to ensure its "DMA" isn't lost.

> > As long as any memory the guest expects to be persistent is backed by
> > mmu-notifier coordination we're all good, otherwise an elevated
> > reference count does not coordinate with truncate in a reliable way.

KVM itself is (mostly) blissfully unaware of any such expectations. The
userspace VMM, e.g. Qemu, is ultimately responsible for ensuring the guest
sees a valid model, e.g. that persistent memory (as presented to the guest)
is actually persistent (from the guest's perspective).

The big caveat is the truncation issue above.