Re: [PATCH v1] mm, hugepages: add mremap() support for hugepage backed vma

From: Mina Almasry
Date: Wed Aug 04 2021 - 14:03:40 EST


On Mon, Aug 2, 2021 at 4:51 PM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
>
> On 7/30/21 3:15 PM, Mina Almasry wrote:
> > From: Ken Chen <kenchen@xxxxxxxxxx>
> >
> > Support mremap() for hugepage backed vma segment by simply repositioning
> > page table entries. The page table entries are repositioned to the new
> > virtual address on mremap().
> >
> > Hugetlb mremap() support is of course generic; my motivating use case
> > is a library (hugepage_text), which reloads the ELF text of executables
> > in hugepages. This significantly increases the execution performance of
> > said executables.
> >
> > Restricts the mremap operation on hugepages to up to the size of the
> > original mapping as the underlying hugetlb reservation is not yet
> > capable of handling remapping to a larger size.
> >
> > Tested with a simple mmap/mremap test case, roughly:
> >
> > void* haddr = mmap(NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC,
> > MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> >
> > void* taddr = mmap(NULL, size, PROT_NONE,
> > MAP_HUGETLB | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> >
> > void* raddr = mremap(haddr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, taddr);
>
> Agree with Andrew that adding actual tests would help.
>

Fixed in upcoming v2.

> >
> > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx>
> >
> > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > Cc: linux-mm@xxxxxxxxx
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > Cc: Ken Chen <kenchen@xxxxxxxxxx>
> > Cc: Chris Kennelly <ckennelly@xxxxxxxxxx>
> >
> > ---
> > include/linux/hugetlb.h | 13 ++++++
> > mm/hugetlb.c | 89 +++++++++++++++++++++++++++++++++++++++++
> > mm/mremap.c | 75 ++++++++++++++++++++++++++++++++--
> > 3 files changed, 174 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index f7ca1a3870ea5..685a289b58401 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -124,6 +124,7 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
> > void hugepage_put_subpool(struct hugepage_subpool *spool);
> >
> > void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > +void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
> > int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *);
> > int hugetlb_overcommit_handler(struct ctl_table *, int, void *, size_t *,
> > loff_t *);
> > @@ -132,6 +133,8 @@ int hugetlb_treat_movable_handler(struct ctl_table *, int, void *, size_t *,
> > int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t *,
> > loff_t *);
> >
> > +int move_hugetlb_page_tables(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, unsigned long len);
> > int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> > long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> > struct page **, struct vm_area_struct **,
> > @@ -218,6 +221,10 @@ static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> > {
> > }
> >
> > +static inline void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
> > +{
> > +}
> > +
> > static inline unsigned long hugetlb_total_pages(void)
> > {
> > return 0;
> > @@ -265,6 +272,12 @@ static inline int copy_hugetlb_page_range(struct mm_struct *dst,
> > return 0;
> > }
> >
> > +#define move_hugetlb_page_tables(vma, old_addr, new_addr, len) \
> > + ({ \
> > + BUG(); \
> > + 0; \
> > + })
> > +
> > static inline void hugetlb_report_meminfo(struct seq_file *m)
> > {
> > }
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 528947da65c8f..bd26b00caf3cf 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1004,6 +1004,23 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> > vma->vm_private_data = (void *)0;
> > }
> >
> > +/*
> > + * Reset and decrement one ref on hugepage private reservation.
> > + * Called with mm->mmap_sem writer semaphore held.
> > + * This function should be only used by move_vma() and operate on
> > + * same sized vma. It should never come here with last ref on the
> > + * reservation.
> > + */
> > +void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
> > +{
> > + struct resv_map *reservations = vma_resv_map(vma);
> > +
> > + if (reservations && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> > + kref_put(&reservations->refs, resv_map_release);
> > +
> > + reset_vma_resv_huge_pages(vma);
> > +}
> > +
> > /* Returns true if the VMA has associated reserve pages */
> > static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> > {
> > @@ -4429,6 +4446,73 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > return ret;
> > }
> >
> > +static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr);
> > +
> > +static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, pte_t *src_pte)
> > +{
> > + struct address_space *mapping = vma->vm_file->f_mapping;
> > + struct hstate *h = hstate_vma(vma);
> > + struct mm_struct *mm = vma->vm_mm;
> > + pte_t *dst_pte, pte;
> > + spinlock_t *src_ptl, *dst_ptl;
> > +
> > + /* Shared pagetables need more thought here if we re-enable them */
> > + BUG_ON(vma_shareable(vma, old_addr));
>
> I agree that shared page tables will complicate the code. Where do you
> actually prevent mremap on mappings which can share page tables? I
> don't see anything before this BUG.
>

Sorry, I added a check in mremap to return early if
hugetlb_vma_sharable() in v2.

> > +
> > + /* Prevent race with file truncation */
> > + i_mmap_lock_write(mapping);
>
> It may not apply as long as you really prevent remap of mappings which
> can share page tables, but i_mmap_lock_write also protects against pmd
> unsharing. In a mapping with sharing possible, src_pte is not stable
> until i_mmap_rwsem is held in write mode.
>
> > +
> > + dst_pte = huge_pte_offset(mm, new_addr, huge_page_size(h));
> > + dst_ptl = huge_pte_lock(h, mm, dst_pte);
> > + src_ptl = huge_pte_lockptr(h, mm, src_pte);
> > + /*
> > + * We don't have to worry about the ordering of src and dst ptlocks
> > + * because exclusive mmap_sem (or the i_mmap_lock) prevents deadlock.
> > + */
> > + if (src_ptl != dst_ptl)
> > + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +
> > + pte = huge_ptep_get_and_clear(mm, old_addr, src_pte);
> > + set_huge_pte_at(mm, new_addr, dst_pte, pte);
> > +
> > + if (src_ptl != dst_ptl)
> > + spin_unlock(src_ptl);
> > + spin_unlock(dst_ptl);
> > + i_mmap_unlock_write(mapping);
> > +}
> > +
> > +int move_hugetlb_page_tables(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, unsigned long len)
> > +{
> > + struct hstate *h = hstate_vma(vma);
> > + unsigned long sz = huge_page_size(h);
> > + struct mm_struct *mm = vma->vm_mm;
> > + unsigned long old_end = old_addr + len;
> > + pte_t *src_pte, *dst_pte;
> > + struct mmu_notifier_range range;
> > +
> > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, old_addr,
> > + old_end);
> > + mmu_notifier_invalidate_range_start(&range);
> > + for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
> > + src_pte = huge_pte_offset(mm, old_addr, sz);
> > + if (!src_pte)
> > + continue;
> > + if (huge_pte_none(huge_ptep_get(src_pte)))
> > + continue;
> > + dst_pte = huge_pte_alloc(mm, vma, new_addr, sz);
> > + if (!dst_pte)
> > + break;
> > +
> > + move_huge_pte(vma, old_addr, new_addr, src_pte);
> > + }
> > + flush_tlb_range(vma, old_end - len, old_end);
>
> Isn't 'old_end - len' == old_addr? If so, I think old_addr is more
> clear here.
>

old_addr is getting modified in the for loop to continually point to
the old address that's being moved in the iteration. 'old_addr += sz'.

> > + mmu_notifier_invalidate_range_end(&range);
> > +
> > + return len + old_addr - old_end;
> > +}
> > +
> > void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > unsigned long start, unsigned long end,
> > struct page *ref_page)
> > @@ -6043,6 +6127,11 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> > }
> >
> > #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> > +static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> > +{
> > + return false;
> > +}
> > +
> > pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long addr, pud_t *pud)
> > {
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index badfe17ade1f0..3c0ee2bb9c439 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -489,6 +489,9 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > old_end = old_addr + len;
> > flush_cache_range(vma, old_addr, old_end);
> >
> > + if (is_vm_hugetlb_page(vma))
> > + return move_hugetlb_page_tables(vma, old_addr, new_addr, len);
> > +
> > mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> > old_addr, old_end);
> > mmu_notifier_invalidate_range_start(&range);
> > @@ -642,6 +645,57 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> > mremap_userfaultfd_prep(new_vma, uf);
> > }
> >
> > + if (is_vm_hugetlb_page(vma)) {
> > + /*
> > + * Clear the old hugetlb private page reservation.
> > + * It has already been transferred to new_vma.
> > + *
> > + * The reservation tracking for hugetlb private mapping is
> > + * done in two places:
> > + * 1. implicit vma size, e.g. vma->vm_end - vma->vm_start
> > + * 2. tracking of hugepages that has been faulted in already,
> > + * this is done via a linked list hanging off
> > + * vma_resv_map(vma).
> > + *
> > + * Each hugepage vma also has hugepage specific vm_ops method
> > + * and there is an imbalance in the open() and close method.
> > + *
> > + * In the open method (hugetlb_vm_op_open), a ref count is
> > + * obtained on the structure that tracks faulted in pages.
> > + *
> > + * In the close method, it unconditionally returns pending
> > + * reservation on the vma as well as release a kref count and
> > + * calls release function upon last reference.
> > + *
> > + * Because of this unbalanced operation in the open/close
> > + * method, this code runs into trouble in the mremap() path:
> > + * copy_vma will copy the pointer to the reservation structure,
> > + * then calls vma->vm_ops->open() method, which only increments
> > + * ref count on the tracking structure and does not do actual
> > + * reservation. In the same code sequence from move_vma(), the
> > + * close() method is called as a result of cleaning up original
> > + * vma segment from a call to do_munmap(). At this stage, the
> > + * tracking and reservation is out of balance, e.g. the
> > + * reservation is returned, however there is an active ref on
> > + * the tracking structure.
> > + *
> > + * When the remap'ed vma unmaps (either implicit at process
> > + * exit or explicit munmap), the reservation will be returned
> > + * again because hugetlb_vm_op_close calculate pending
> > + * reservation unconditionally based on size of vma. This
> > + * cause h->resv_huge_pages. to underflow and no more hugepages
> > + * can be allocated to application in certain situation.
> > + *
> > + * We need to reset and clear the tracking reservation, such
> > + * that we don't prematurely returns hugepage reservation at
> > + * mremap time. The reservation should only be returned at
> > + * munmap() time. This is totally undesired, however, we
> > + * don't want to re-factor hugepage reservation code at this
> > + * stage for prod kernel. Resetting is the least risky method.
> > + */
>
> We never had remapping support before, so we never had to deal with this
> situation. This new code just throws away reservations when remapping
> an anon vma area. Correct? I would like to at least think about how to
> preserve the reservations. In the move case, we know the vma will be
> the same size. So we would not even need to adjust reht reserve map,
> just preserve it for the new mapping.
>
> The explanation is helpful. However, it might make more sense to put
> it at the beginning of clear_vma_resv_huge_pages in hugetlb.c. It is a
> big comment in mremap.c that is very hugetlb specific.
>

For v2 I moved the comment locally. Let me see if I can preserve the
reservation like you described.

> > + clear_vma_resv_huge_pages(vma);
> > + }
> > +
> > /* Conceal VM_ACCOUNT so old reservation is not undone */
> > if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) {
> > vma->vm_flags &= ~VM_ACCOUNT;
> > @@ -736,9 +790,6 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
> > (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
> > return ERR_PTR(-EINVAL);
> >
> > - if (is_vm_hugetlb_page(vma))
> > - return ERR_PTR(-EINVAL);
> > -
> > /* We can't remap across vm area boundaries */
> > if (old_len > vma->vm_end - addr)
> > return ERR_PTR(-EFAULT);
> > @@ -949,6 +1000,24 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> >
> > if (mmap_write_lock_killable(current->mm))
> > return -EINTR;
> > + vma = find_vma(mm, addr);
> > + if (!vma || vma->vm_start > addr)
> > + goto out;
>
> It looks like previously we would have returned EFAULT for this condition
> on non-hugetlb vmas?
>

Should be fixed in v2.

> I see all the special handling for vmas with userfaultfd ranges that are
> remapped. Did you look into the details to see if that still works with
> hugetlb mappings?

No, my test case currently (which I will include with v2) is quite
simple and doesn't do anything related to userfaultfd. Let me see if I
can add some userfaultfd usage to the test to detect any issues.


> --
> Mike Kravetz
>
> > +
> > + if (is_vm_hugetlb_page(vma)) {
> > + struct hstate *h __maybe_unused = hstate_vma(vma);
> > +
> > + if (old_len & ~huge_page_mask(h) ||
> > + new_len & ~huge_page_mask(h))
> > + goto out;
> > +
> > + /*
> > + * Don't allow remap expansion, because the underlying hugetlb
> > + * reservation is not yet capable to handle split reservation.
> > + */
> > + if (new_len > old_len)
> > + goto out;
> > + }
> >
> > if (flags & (MREMAP_FIXED | MREMAP_DONTUNMAP)) {
> > ret = mremap_to(addr, old_len, new_addr, new_len,
> > --
> > 2.32.0.554.ge1b32706d8-goog
> >