Re: [PATCH v2] mm/gup: fix try_grab_compound_head() race with split_huge_page()

From: Jason Gunthorpe
Date: Fri Jun 18 2021 - 10:58:43 EST


On Fri, Jun 18, 2021 at 02:50:00PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 18, 2021 at 10:25:56AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 15, 2021 at 02:09:38PM +0200, Jann Horn wrote:
> > > On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@xxxxxxxxxx> wrote:
> > > > On 6/14/21 6:20 PM, Jann Horn wrote:
> > > > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> > > > > if (WARN_ON_ONCE(page_ref_count(head) < 0))
> > > > > return NULL;
> > > > > if (unlikely(!page_cache_add_speculative(head, refs)))
> > > > > return NULL;
> > > > > +
> > > > > + /*
> > > > > + * At this point we have a stable reference to the head page; but it
> > > > > + * could be that between the compound_head() lookup and the refcount
> > > > > + * increment, the compound page was split, in which case we'd end up
> > > > > + * holding a reference on a page that has nothing to do with the page
> > > > > + * we were given anymore.
> > > > > + * So now that the head page is stable, recheck that the pages still
> > > > > + * belong together.
> > > > > + */
> > > > > + if (unlikely(compound_head(page) != head)) {
> > > >
> > > > I was just wondering about what all could happen here. Such as: page gets split,
> > > > reallocated into a different-sized compound page, one that still has page pointing
> > > > to head. I think that's OK, because we don't look at or change other huge page
> > > > fields.
> > > >
> > > > But I thought I'd mention the idea in case anyone else has any clever ideas about
> > > > how this simple check might be insufficient here. It seems fine to me, but I
> > > > routinely lack enough imagination about concurrent operations. :)
> > >
> > > Hmmm... I think the scariest aspect here is probably the interaction
> > > with concurrent allocation of a compound page on architectures with
> > > store-store reordering (like ARM). *If* the page allocator handled
> > > compound pages with lockless, non-atomic percpu freelists, I think it
> > > might be possible that the zeroing of tail_page->compound_head in
> > > put_page() could be reordered after the page has been freed,
> > > reallocated and set to refcount 1 again?
> >
> > Oh wow, yes, this all looks sketchy! Doing a RCU access to page->head
> > is a really challenging thing :\
> >
> > On the simplified store side:
> >
> > page->head = my_compound
> > *ptep = page
> >
> > There must be some kind of release barrier between those two
> > operations or this is all broken.. That definately deserves a comment.
>
> set_compound_head() includes a WRITE_ONCE. Is that enough, or does it
> need an smp_wmb()?

Probably, at least the generic code maps smp_store_release() to
__smp_wmb.

I think Jann was making the argument that there is going to be some
other release operation due to locking between the two above, eg a
lock unlock or something.

> > Ideally we'd use smp_store_release to install the *pte :\
> >
> > Assuming we cover the release barrier, I would think the algorithm
> > should be broadly:
> >
> > struct page *target_page = READ_ONCE(pte)
> > struct page *target_folio = READ_ONCE(target_page->head)
>
> compound_head() includes a READ_ONCE already.

Ah, see I obviously haven't memorized that detail :\

> > page_cache_add_speculative(target_folio, refs)
>
> That's spelled folio_ref_try_add_rcu() right now.

That seems a much better name

> > if (target_folio != READ_ONCE(target_page->head) ||
> > target_page != READ_ONCE(pte))
> > goto abort
> >
> > Which is what this patch does but I would like to see the
> > READ_ONCE's.
>
> ... you want them to be uninlined from compound_head(), et al?

Not really (though see below), I was mostly looking at the pte which
just does pte_val(), no READ_ONCE in there

> > And there possibly should be two try_grab_compound_head()'s since we
> > don't need this overhead on the fully locked path, especially the
> > double atomic on page_ref_add()
>
> There's only one atomic on page_ref_add().

Look at the original patch, it adds this:

+ else
+ page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));

Where page is the folio, which is now two atomics to do the same
ref. This is happening because we can't do hpage_pincount_available()
before having initially locked the folio, thus we can no longer
precompute what 'ref' to give to the first folio_ref_try_add_rcu()

> And you need more of this overhead on the fully locked path than you
> realise; the page might be split without holding the mmap_sem, for
> example.

Fully locked here means holding the PTL spinlocks, so we know the pte
cannot change and particularly the refcount of a folio can't go to
zero. We can't change compound_head if the refcount is
elevated.

Keep in mind we also do this in gpu:

folio_ref_try_add_rcu(READ_ONCE(target_page->head), 1)
[..]
folio_put_refs(READ_ONCE(target_page->head), 1)

Which makes me wonder why we have READ_ONCE inside compound_head?

I'm reading the commit message of 1d798ca3f164 ("mm: make
compound_head() robust"), and to me that looks like another special
lockless algorithm that should have the READ_ONCE in it, not the
general code.

Jason