Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

From: Mike Kravetz
Date: Mon Sep 12 2022 - 19:03:02 EST

Next message: Ira Weiny: "Re: [RFC PATCH 5/9] cxl/mem: Trace DRAM Event Record"
Previous message: Sathyanarayanan Kuppuswamy: "Re: [PATCH v13 1/3] x86/tdx: Add TDX Guest attestation interface driver"
Next in thread: Miaohe Lin: "Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 09/05/22 11:08, Miaohe Lin wrote:
> On 2022/9/3 7:07, Mike Kravetz wrote:
> > On 08/30/22 10:02, Miaohe Lin wrote:
> >> On 2022/8/25 1:57, Mike Kravetz wrote:
> >>> The new hugetlb vma lock (rw semaphore) is used to address this race:
> >>>
> >>> Faulting thread Unsharing thread
> >>> ... ...
> >>> ptep = huge_pte_offset()
> >>> or
> >>> ptep = huge_pte_alloc()
> >>> ...
> >>> i_mmap_lock_write
> >>> lock page table
> >>> ptep invalid <------------------------ huge_pmd_unshare()
> >>> Could be in a previously unlock_page_table
> >>> sharing process or worse i_mmap_unlock_write
> >>> ...
> >>>
> >>> The vma_lock is used as follows:
> >>> - During fault processing. the lock is acquired in read mode before
> >>> doing a page table lock and allocation (huge_pte_alloc). The lock is
> >>> held until code is finished with the page table entry (ptep).
> >>> - The lock must be held in write mode whenever huge_pmd_unshare is
> >>> called.
> >>>
> >>> Lock ordering issues come into play when unmapping a page from all
> >>> vmas mapping the page. The i_mmap_rwsem must be held to search for the
> >>> vmas, and the vma lock must be held before calling unmap which will
> >>> call huge_pmd_unshare. This is done today in:
> >>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
> >>> error handling. In these routines we 'try' to obtain the vma lock and
> >>> fail to unmap if unsuccessful. Calling routines already deal with the
> >>> failure of unmapping.
> >>> - hugetlb_vmdelete_list for truncation and hole punch. This routine
> >>> also tries to acquire the vma lock. If it fails, it skips the
> >>> unmapping. However, we can not have file truncation or hole punch
> >>> fail because of contention. After hugetlb_vmdelete_list, truncation
> >>> and hole punch call remove_inode_hugepages. remove_inode_hugepages
> >>> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> >>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> >>> correct order to guarantee unmap success.
> >>>
> >>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> >>> ---
> >>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
> >>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
> >>> mm/memory.c | 2 +
> >>> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
> >>> mm/userfaultfd.c | 9 +++-
> >>> 5 files changed, 214 insertions(+), 45 deletions(-)
> >>>
> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> >>> index b93d131b0cb5..52d9b390389b 100644
> >>> --- a/fs/hugetlbfs/inode.c
> >>> +++ b/fs/hugetlbfs/inode.c
> >>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>> struct folio *folio, pgoff_t index)
> >>> {
> >>> struct rb_root_cached *root = &mapping->i_mmap;
> >>> + unsigned long skipped_vm_start;
> >>> + struct mm_struct *skipped_mm;
> >>> struct page *page = &folio->page;
> >>> struct vm_area_struct *vma;
> >>> unsigned long v_start;
> >>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>> end = ((index + 1) * pages_per_huge_page(h));
> >>>
> >>> i_mmap_lock_write(mapping);
> >>> +retry:
> >>> + skipped_mm = NULL;
> >>>
> >>> vma_interval_tree_foreach(vma, root, start, end - 1) {
> >>> v_start = vma_offset_start(vma, start);
> >>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> >>> continue;
> >>>
> >>> + if (!hugetlb_vma_trylock_write(vma)) {
> >>> + /*
> >>> + * If we can not get vma lock, we need to drop
> >>> + * immap_sema and take locks in order.
> >>> + */
> >>> + skipped_vm_start = vma->vm_start;
> >>> + skipped_mm = vma->vm_mm;
> >>> + /* grab mm-struct as we will be dropping i_mmap_sema */
> >>> + mmgrab(skipped_mm);
> >>> + break;
> >>> + }
> >>> +
> >>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> >>> NULL, ZAP_FLAG_DROP_MARKER);
> >>> + hugetlb_vma_unlock_write(vma);
> >>> }
> >>>
> >>> i_mmap_unlock_write(mapping);
> >>> +
> >>> + if (skipped_mm) {
> >>> + mmap_read_lock(skipped_mm);
> >>> + vma = find_vma(skipped_mm, skipped_vm_start);
> >>> + if (!vma || !is_vm_hugetlb_page(vma) ||
> >>> + vma->vm_file->f_mapping != mapping ||
> >>> + vma->vm_start != skipped_vm_start) {
> >>
> >> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
> >>
> >
> > Yes, that is missing. I will add here.
> >
> >>> + mmap_read_unlock(skipped_mm);
> >>> + mmdrop(skipped_mm);
> >>> + goto retry;
> >>> + }
> >>> +
> >>
> >> IMHO, above check is not enough. Think about the below scene:
> >>
> >> CPU 1 CPU 2
> >> hugetlb_unmap_file_folio exit_mmap
> >> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
> >> check vma is wanted.
> >> unmap_vmas
> >> mmap_read_unlock(skipped_mm); mmap_read_unlock
> >> mmap_write_lock(mm);
> >> free_pgtables
> >> remove_vma
> >> hugetlb_vma_lock_free
> >> vma, hugetlb_vma_lock is still *used after free*
> >> mmap_write_unlock(mm);
> >> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
> >
> > In the retry case, we are OK because go back and look up the vma again. Right?
> >
> > After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
> > Before that, we do the following:
> >
> >>> + hugetlb_vma_lock_write(vma);
> >>> + i_mmap_lock_write(mapping);
> >
> > IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we
>
> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.
>
> > can.
> >
> >>> + mmap_read_unlock(skipped_mm);
> >>> + mmdrop(skipped_mm);
> >
> > We continue to hold i_mmap_lock_write as we goto retry.
> >
> > I could be missing something as well. This was how I intended to keep
> > vma valid while dropping and acquiring locks.
>
> Thanks for your clarifying.
>

Well, that was all correct 'in theory' but not in practice. I did not take
into account the inode lock that is taken at the beginning of truncate (or
hole punch). In other code paths, we take inode lock after mmap_lock. So,
taking mmap_lock here is not allowed.

I came up with another way to make this work. As discussed above, we need to
drop the i_mmap lock before acquiring the vma_lock. However, once we drop
i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a
ref counted structure that can live on after the vma is freed. Therefore,
this code can take a reference while under i_mmap then drop i_mmap and wait
on the vma_lock. Of course, once it acquires the vma_lock it needs to check
and make sure the vma still exists. It may sound complicated, but I think
it is a bit simpler than the code here. A new series will be out soon.
--
Mike Kravetz

Next message: Ira Weiny: "Re: [RFC PATCH 5/9] cxl/mem: Trace DRAM Event Record"
Previous message: Sathyanarayanan Kuppuswamy: "Re: [PATCH v13 1/3] x86/tdx: Add TDX Guest attestation interface driver"
Next in thread: Miaohe Lin: "Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]