Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

From: David Stevens
Date: Tue Sep 05 2023 - 12:59:20 EST


On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Jul 11, 2023, Zhi Wang wrote:
> > On Thu, 6 Jul 2023 15:49:39 +0900
> > David Stevens <stevensd@xxxxxxxxxxxx> wrote:
> >
> > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, 4 Jul 2023 16:50:48 +0900
> > > > David Stevens <stevensd@xxxxxxxxxxxx> wrote:
> > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > a temporary solution, more like we are abusing the toleration.
> > >
> > > I'm not sure I understand what you're getting at. This series never
> > > calls gup without FOLL_GET.
> > >
> > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > notifier, it makes sense to support returning a pfn without taking a
> > > reference. And we indeed need to do that for certain types of memory.
> > >
> >
> > I am not having prob with taking a pfn without taking a ref. I am
> > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > a pfn without a ref is a good idea or not, while there is another flag
> > actually showing it.
> >
> > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > requirements of GUP in the code path that going to call GUP is reasonable.
> >
> > But using FOLL_XXX with purposes that are not related to GUP call really
> > feels off.
>
> I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> that handles non-refcounted pages, i.e. this
>
> if (get_page_unless_zero(page)) {
> foll->is_refcounted_page = true;
> if (!(foll->flags & FOLL_GET))
> put_page(page);
> } else if (foll->flags & FOLL_GET) {
> r = -EFAULT;
> }
>
> should be
>
> if (get_page_unless_zero(page)) {
> foll->is_refcounted_page = true;
> if (!(foll->flags & FOLL_GET))
> put_page(page);
> else if (!foll->guarded_by_mmu_notifier)
> r = -EFAULT;
>
> because it's not the desire to grab a reference that makes getting non-refcounted
> pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
>
> Though that highlights that checking guarded_by_mmu_notifier should be done for
> *all* non-refcounted pfns, not just non-refcounted struct page memory.

I think things are getting confused here because there are multiple
things which "safe" refers to. There are three different definitions
that I think are relevant here:

1) "safe" in the sense that KVM doesn't corrupt page reference counts
2) "safe" in the sense that KVM doesn't access pfns after they have been freed
3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations

For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
then they expect to be able to pass the returned pfn to
kvm_release_pfn. This means that when FOLL_GET is set, if
kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
must take a reference count to avoid eventually corrupting the page
ref count. I guess replacing the FOLL_GET check with
!guarded_by_mmu_notifier is logically equivalent because
__kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
and FOLL_GET is set. But since we're concerned about a property of the
refcount, I think that checking FOLL_GET is clearer.

For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
is set, then we're all good here. If guarded_by_mmu_notifier is not
set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
set. For struct page memory, we're safe because KVM will hold a
reference as long as it's still using the page. For non struct page
memory, we're not safe - this is where the breaking change of
allow_unsafe_mappings would go. Note that for non-refcounted struct
page, we can't use the allow_unsafe_mappings escape hatch. Since
FOLL_GET was requested, if we returned such a page, then the caller
would eventually corrupt the page refcount via kvm_release_pfn.

Property 3 would be nice, but we've already concluded that guarding
all translations with mmu notifiers is infeasible. So maintaining
property 2 is the best we can hope for.

> As for the other usage of FOLL_GET in this series (using it to conditionally do
> put_page()), IMO that's very much related to the GUP call. Invoking put_page()
> is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> without grabbing a reference to the page. In an ideal world, KVM would NOT pass
> FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
>
> I do think it's worth providing a helper to consolidate and document that hacky
> code, e.g. add a kvm_follow_refcounted_pfn() helper.
>
> All in all, I think the below (completely untested) is what we want?
>
> David (and others), I am planning on doing a full review of this series "soon",
> but it will likely be a few weeks until that happens. I jumped in on this
> specific thread because this caught my eye and I really don't want to throw out
> *all* of the FOLL_GET usage.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5b5afd70f239..90d424990e0a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> return rc == -EHWPOISON;
> }
>
> +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> + struct page *page)
> +{
> + kvm_pfn_t pfn = page_to_pfn(page);
> +
> + foll->is_refcounted_page = true;
> +
> + /*
> + * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> + * doesn't want to grab a reference, but gup() doesn't support getting
> + * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever
> + * changes, drop this and simply don't pass FOLL_GET to gup().
> + */
> + if (!(foll->flags & FOLL_GET))
> + put_page(page);
> +
> + return pfn;
> +}
> +
> /*
> * The fast path to get the writable pfn which will be stored in @pfn,
> * true indicates success, otherwise false is returned. It's also the
> @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> return false;
>
> if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> - *pfn = page_to_pfn(page[0]);
> foll->writable = foll->allow_write_mapping;
> - foll->is_refcounted_page = true;
> - if (!(foll->flags & FOLL_GET))
> - put_page(page[0]);
> +
> + *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> return true;
> }
>
> @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> return npages;
>
> foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> - foll->is_refcounted_page = true;
>
> /* map read fault as writable if possible */
> if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> page = wpage;
> }
> }
> - *pfn = page_to_pfn(page);
> - if (!(foll->flags & FOLL_GET))
> - put_page(page);
> +
> + *pfn = kvm_follow_refcounted_pfn(foll, page);
> return npages;
> }
>
> @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> if (!page)
> goto out;
>
> - if (get_page_unless_zero(page)) {
> - foll->is_refcounted_page = true;
> - if (!(foll->flags & FOLL_GET))
> - put_page(page);
> - } else if (foll->flags & FOLL_GET) {
> - r = -EFAULT;
> - }
> -
> + if (get_page_unless_zero(page))
> + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> out:
> pte_unmap_unlock(ptep, ptl);
> - *p_pfn = pfn;
> +
> + if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> + !allow_unsafe_mappings)
> + r = -EFAULT;
> + else
> + *p_pfn = pfn;
>
> return r;
> }
>

As I pointed out above, this suggestion is broken because a FOLL_GET
&& !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
non-refcounted page will result in the refcount eventually being
corrupted.

What do you think of this implementation? If it makes sense, I can
send out an updated patch series.

/*
* If FOLL_GET is set, then the caller wants us to take a reference to
* keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
* guarantees that guarded_by_mmu_notifier is set, so there aren't any
* use-after-free concerns.
*/
page = kvm_pfn_to_refcounted_page(pfn);
if (page) {
if (get_page_unless_zero(page)) {
WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
} else if (foll->flags & FOLL_GET) {
/*
* Certain IO or PFNMAP mappings can be backed with
* valid struct pages but be allocated without
* refcounting e.g., tail pages of non-compound higher
* order allocations. The caller asked for a ref, but
* we can't take one, since releasing such a ref would
* free the page.
*/
r = -EFAULT;
}
} else if (foll->flags & FOLL_GET) {
/*
* When there's no struct page to refcount and no MMU notifier,
* then KVM can't be guarantee to avoid use-after-free. However,
* there are valid reasons to set up such mappings. If userspace
* is trusted and willing to forego kernel safety guarantees,
* allow this check to be bypassed.
*/
if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)
r = -EFAULT;
}

-David