Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages

From: John Hubbard
Date: Mon Nov 04 2019 - 17:49:23 EST

Next message: Vladimir Oltean: "Re: [PATCH net] taprio: fix panic while hw offload sched list swap"
Previous message: Darrick J. Wong: "Re: [PATCH 02/28] xfs: Throttle commits on delayed background CIL push"
In reply to: Jerome Glisse: "Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages"
Next in thread: Jerome Glisse: "Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/4/19 10:52 AM, Jerome Glisse wrote:
> On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
>> Add tracking of pages that were pinned via FOLL_PIN.
>>
>> As mentioned in the FOLL_PIN documentation, callers who effectively set
>> FOLL_PIN are required to ultimately free such pages via put_user_page().
>> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
>> for DIO and/or RDMA use".
>>
>> Pages that have been pinned via FOLL_PIN are identifiable via a
>> new function call:
>>
>> bool page_dma_pinned(struct page *page);
>>
>> What to do in response to encountering such a page, is left to later
>> patchsets. There is discussion about this in [1].
>>
>> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
>>
>> This also has a couple of trivial, non-functional change fixes to
>> try_get_compound_head(). That function got moved to the top of the
>> file.
>
> Maybe split that as a separate trivial patch.

Will do.

>
>>
>> This includes the following fix from Ira Weiny:
>>
>> DAX requires detection of a page crossing to a ref count of 1. Fix this
>> for GUP pages by introducing put_devmap_managed_user_page() which
>> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
>
> Please do the put_devmap_managed_page() changes in a separate
> patch, it would be a lot easier to follow, also on that front
> see comments below.

Oh! OK. It makes sense when you say it out loud. :)

...
>> +static inline bool put_devmap_managed_page(struct page *page)
>> +{
>> + bool is_devmap = page_is_devmap_managed(page);
>> +
>> + if (is_devmap) {
>> + int count = page_ref_dec_return(page);
>> +
>> + __put_devmap_managed_page(page, count);
>> + }
>> +
>> + return is_devmap;
>> +}
>
> I think the __put_devmap_managed_page() should be rename
> to free_devmap_managed_page() and that the count != 1
> case move to this inline function ie:
>
> static inline bool put_devmap_managed_page(struct page *page)
> {
> bool is_devmap = page_is_devmap_managed(page);
>
> if (is_devmap) {
> int count = page_ref_dec_return(page);
>
> /*
> * If refcount is 1 then page is freed and refcount is stable as nobody
> * holds a reference on the page.
> */
> if (count == 1)
> free_devmap_managed_page(page, count);
> else if (!count)
> __put_page(page);
> }
>
> return is_devmap;
> }
>

Thanks, that does look cleaner and easier to read.

>
>> +
>> #else /* CONFIG_DEV_PAGEMAP_OPS */
>> static inline bool put_devmap_managed_page(struct page *page)
>> {
>> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
>> return true;
>> }
>>
>> +__must_check bool user_page_ref_inc(struct page *page);
>> +
>
> What about having it as an inline here as it is pretty small.

You mean move it to a static inline function in mm.h? It's worse than it
looks, though: *everything* that it calls is also a static function, local
to gup.c. So I'd have to expose both try_get_compound_head() and
__update_proc_vmstat(). And that also means calling mod_node_page_state() from
mm.h, and it goes south right about there. :)

...
>> +/**
>> + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
>> + * or pin_longterm_pages*()
>> + * @page: pointer to page to be queried.
>> + * @Return: True, if it is likely that the page has been "dma-pinned".
>> + * False, if the page is definitely not dma-pinned.
>> + */
>
> Maybe add a small comment about wrap around :)

I don't *think* the count can wrap around, due to the checks in user_page_ref_inc().

But it's true that the documentation is a little light here...What did you have
in mind?

> [...]
>
>> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>
>> pgmap = get_dev_pagemap(pfn, pgmap);
>> if (unlikely(!pgmap)) {
>> - undo_dev_pagemap(nr, nr_start, pages);
>> + undo_dev_pagemap(nr, nr_start, flags, pages);
>> return 0;
>> }
>> SetPageReferenced(page);
>> pages[*nr] = page;
>> - get_page(page);
>> +
>> + if (flags & FOLL_PIN) {
>> + if (unlikely(!user_page_ref_inc(page))) {
>> + undo_dev_pagemap(nr, nr_start, flags, pages);
>> + return 0;
>> + }
>
> Maybe add a comment about a case that should never happens ie
> user_page_ref_inc() fails after the second iteration of the
> loop as it would be broken and a bug to call undo_dev_pagemap()
> after the first iteration of that loop.
>
> Also i believe that this should never happens as if first
> iteration succeed than __page_cache_add_speculative() will
> succeed for all the iterations.
>
> Note that the pgmap case above follows that too ie the call to
> get_dev_pagemap() can only fail on first iteration of the loop,
> well i assume you can never have a huge device page that span
> different pgmap ie different devices (which is a reasonable
> assumption). So maybe this code needs fixing ie :
>
> pgmap = get_dev_pagemap(pfn, pgmap);
> if (unlikely(!pgmap))
> return 0;
>
>

OK, yes that does make sense. And I think a comment is adequate,
no need to check for bugs during every tail page iteration. So how
about this, as a preliminary patch:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..a4a81e125832 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1892,17 +1892,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
static int __gup_device_huge(unsigned long pfn, unsigned long addr,
unsigned long end, struct page **pages, int *nr)
{
- int nr_start = *nr;
- struct dev_pagemap *pgmap = NULL;
+ /*
+ * Huge pages should never cross dev_pagemap boundaries. Therefore, use
+ * this same pgmap for the entire huge page.
+ */
+ struct dev_pagemap *pgmap = get_dev_pagemap(pfn, NULL);
+
+ if (unlikely(!pgmap))
+ return 0;

do {
struct page *page = pfn_to_page(pfn);

- pgmap = get_dev_pagemap(pfn, pgmap);
- if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, pages);
- return 0;
- }
SetPageReferenced(page);
pages[*nr] = page;
get_page(page);

>> + } else
>> + get_page(page);
>> +
>> (*nr)++;
>> pfn++;
>> } while (addr += PAGE_SIZE, addr != end);
>
> [...]
>
>> @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>> unsigned long addr, len, end;
>> int nr = 0, ret = 0;
>>
>> - if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
>> + if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
>
> Maybe add a comments to explain, something like:
>
> /*
> * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN
> *
> * Note that get_user_pages_fast() imply FOLL_GET flag by default but
> * callers can over-ride this default to pin case by setting FOLL_PIN.
> */

Good idea. Here's the draft now:

/*
* The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN.
*
* Note that get_user_pages_fast() implies FOLL_GET flag by default, but
* callers can override this default by setting FOLL_PIN instead of
* FOLL_GET.
*/
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
return -EINVAL;

>
>> return -EINVAL;
>>
>> start = untagged_addr(start) & PAGE_MASK;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 13cc93785006..66bf4c8b88f1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>
> [...]
>
>> @@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>> if (!*pgmap)
>> return ERR_PTR(-EFAULT);
>> page = pfn_to_page(pfn);
>> - get_page(page);
>> +
>> + if (flags & FOLL_GET)
>> + get_page(page);
>> + else if (flags & FOLL_PIN)
>> + if (unlikely(!user_page_ref_inc(page)))
>> + page = ERR_PTR(-ENOMEM);
>
> While i agree that user_page_ref_inc() (ie page_cache_add_speculative())
> should never fails here as we are holding the pmd lock and thus no one
> can unmap the pmd and free the page it points to. I believe you should
> return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should
> not fail either for the same reason. Thus it would be better to have
> consistent error. Maybe also add a comments explaining that it should
> not fail here.
>

OK. I'll take a pass through and fix up the remaining points about these
sorts of cases below, as well, in v3. Those all make sense.

>>
>> return page;
>> }
>
> [...]
>
>> @@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>> * device mapped pages can only be returned if the
>> * caller will manage the page reference count.
>> */
>> - if (!(flags & FOLL_GET))
>> + if (!(flags & (FOLL_GET | FOLL_PIN)))
>> return ERR_PTR(-EEXIST);
>
> Maybe add a comment that FOLL_GET or FOLL_PIN must be set.
>
>> pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
>> @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>> if (!*pgmap)
>> return ERR_PTR(-EFAULT);
>> page = pfn_to_page(pfn);
>> - get_page(page);
>> +
>> + if (flags & FOLL_GET)
>> + get_page(page);
>> + else if (flags & FOLL_PIN)
>> + if (unlikely(!user_page_ref_inc(page)))
>> + page = ERR_PTR(-ENOMEM);
>
> Same as for follow_devmap_pmd() see above.
>
>>
>> return page;
>> }
>> @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>> skip_mlock:
>> page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>> VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
>> +
>> if (flags & FOLL_GET)
>> get_page(page);
>> + else if (flags & FOLL_PIN)
>> + if (unlikely(!user_page_ref_inc(page)))
>> + page = NULL;
>
> This should not fail either as we are holding the pmd lock maybe add
> a comment. Dunno if we want a WARN() or something to catch this
> degenerate case, or dump the page.
>
>>
>> out:
>> return page;
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index b45a95363a84..da335b1cd798 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>> same_page:
>> if (pages) {
>> pages[i] = mem_map_offset(page, pfn_offset);
>> - get_page(pages[i]);
>> +
>> + if (flags & FOLL_GET)
>> + get_page(pages[i]);
>> + else if (flags & FOLL_PIN)
>> + if (unlikely(!user_page_ref_inc(pages[i]))) {
>> + spin_unlock(ptl);
>> + remainder = 0;
>> + err = -ENOMEM;
>> + WARN_ON_ONCE(1);
>> + break;
>> + }
>> }
>
> user_page_ref_inc() should not fail here either because we hold the
> ptl, so the WAR_ON_ONCE() is right but maybe add a comment.
>
>>
>> if (vmas)
>
> [...]
>
>> @@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>> pte = huge_ptep_get((pte_t *)pmd);
>> if (pte_present(pte)) {
>> page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
>> +
>> if (flags & FOLL_GET)
>> get_page(page);
>> + else if (flags & FOLL_PIN)
>> + if (unlikely(!user_page_ref_inc(page))) {
>> + page = NULL;
>> + goto out;
>> + }
>
> This should not fail either (again holding pmd lock), dunno if we want
> a warn or something to catch this degenerate case.
>
>> } else {
>> if (is_hugetlb_entry_migration(pte)) {
>> spin_unlock(ptl);
>
> [...]
>
>

Those are all good points, working on them now.

thanks,
--
John Hubbard
NVIDIA

Next message: Vladimir Oltean: "Re: [PATCH net] taprio: fix panic while hw offload sched list swap"
Previous message: Darrick J. Wong: "Re: [PATCH 02/28] xfs: Throttle commits on delayed background CIL push"
In reply to: Jerome Glisse: "Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages"
Next in thread: Jerome Glisse: "Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]