Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

From: Miaohe Lin
Date: Tue Sep 13 2022 - 22:09:14 EST


On 2022/9/14 8:50, Mike Kravetz wrote:
> On 09/13/22 10:14, Miaohe Lin wrote:
>> On 2022/9/13 7:02, Mike Kravetz wrote:
>>> On 09/05/22 11:08, Miaohe Lin wrote:
>>>> On 2022/9/3 7:07, Mike Kravetz wrote:
>>>>> On 08/30/22 10:02, Miaohe Lin wrote:
>>>>>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>>>>>> The new hugetlb vma lock (rw semaphore) is used to address this race:
>>>>>>>
>>>>>>> Faulting thread Unsharing thread
>>>>>>> ... ...
>>>>>>> ptep = huge_pte_offset()
>>>>>>> or
>>>>>>> ptep = huge_pte_alloc()
>>>>>>> ...
>>>>>>> i_mmap_lock_write
>>>>>>> lock page table
>>>>>>> ptep invalid <------------------------ huge_pmd_unshare()
>>>>>>> Could be in a previously unlock_page_table
>>>>>>> sharing process or worse i_mmap_unlock_write
>>>>>>> ...
>>>>>>>
>>>>>>> The vma_lock is used as follows:
>>>>>>> - During fault processing. the lock is acquired in read mode before
>>>>>>> doing a page table lock and allocation (huge_pte_alloc). The lock is
>>>>>>> held until code is finished with the page table entry (ptep).
>>>>>>> - The lock must be held in write mode whenever huge_pmd_unshare is
>>>>>>> called.
>>>>>>>
>>>>>>> Lock ordering issues come into play when unmapping a page from all
>>>>>>> vmas mapping the page. The i_mmap_rwsem must be held to search for the
>>>>>>> vmas, and the vma lock must be held before calling unmap which will
>>>>>>> call huge_pmd_unshare. This is done today in:
>>>>>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
>>>>>>> error handling. In these routines we 'try' to obtain the vma lock and
>>>>>>> fail to unmap if unsuccessful. Calling routines already deal with the
>>>>>>> failure of unmapping.
>>>>>>> - hugetlb_vmdelete_list for truncation and hole punch. This routine
>>>>>>> also tries to acquire the vma lock. If it fails, it skips the
>>>>>>> unmapping. However, we can not have file truncation or hole punch
>>>>>>> fail because of contention. After hugetlb_vmdelete_list, truncation
>>>>>>> and hole punch call remove_inode_hugepages. remove_inode_hugepages
>>>>>>> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
>>>>>>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
>>>>>>> correct order to guarantee unmap success.
>>>>>>>
>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>>>>>>> ---
>>>>>>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
>>>>>>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
>>>>>>> mm/memory.c | 2 +
>>>>>>> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
>>>>>>> mm/userfaultfd.c | 9 +++-
>>>>>>> 5 files changed, 214 insertions(+), 45 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>>>> index b93d131b0cb5..52d9b390389b 100644
>>>>>>> --- a/fs/hugetlbfs/inode.c
>>>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>>>> struct folio *folio, pgoff_t index)
>>>>>>> {
>>>>>>> struct rb_root_cached *root = &mapping->i_mmap;
>>>>>>> + unsigned long skipped_vm_start;
>>>>>>> + struct mm_struct *skipped_mm;
>>>>>>> struct page *page = &folio->page;
>>>>>>> struct vm_area_struct *vma;
>>>>>>> unsigned long v_start;
>>>>>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>>>> end = ((index + 1) * pages_per_huge_page(h));
>>>>>>>
>>>>>>> i_mmap_lock_write(mapping);
>>>>>>> +retry:
>>>>>>> + skipped_mm = NULL;
>>>>>>>
>>>>>>> vma_interval_tree_foreach(vma, root, start, end - 1) {
>>>>>>> v_start = vma_offset_start(vma, start);
>>>>>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
>>>>>>> continue;
>>>>>>>
>>>>>>> + if (!hugetlb_vma_trylock_write(vma)) {
>>>>>>> + /*
>>>>>>> + * If we can not get vma lock, we need to drop
>>>>>>> + * immap_sema and take locks in order.
>>>>>>> + */
>>>>>>> + skipped_vm_start = vma->vm_start;
>>>>>>> + skipped_mm = vma->vm_mm;
>>>>>>> + /* grab mm-struct as we will be dropping i_mmap_sema */
>>>>>>> + mmgrab(skipped_mm);
>>>>>>> + break;
>>>>>>> + }
>>>>>>> +
>>>>>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>>>>>> NULL, ZAP_FLAG_DROP_MARKER);
>>>>>>> + hugetlb_vma_unlock_write(vma);
>>>>>>> }
>>>>>>>
>>>>>>> i_mmap_unlock_write(mapping);
>>>>>>> +
>>>>>>> + if (skipped_mm) {
>>>>>>> + mmap_read_lock(skipped_mm);
>>>>>>> + vma = find_vma(skipped_mm, skipped_vm_start);
>>>>>>> + if (!vma || !is_vm_hugetlb_page(vma) ||
>>>>>>> + vma->vm_file->f_mapping != mapping ||
>>>>>>> + vma->vm_start != skipped_vm_start) {
>>>>>>
>>>>>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
>>>>>>
>>>>>
>>>>> Yes, that is missing. I will add here.
>>>>>
>>>>>>> + mmap_read_unlock(skipped_mm);
>>>>>>> + mmdrop(skipped_mm);
>>>>>>> + goto retry;
>>>>>>> + }
>>>>>>> +
>>>>>>
>>>>>> IMHO, above check is not enough. Think about the below scene:
>>>>>>
>>>>>> CPU 1 CPU 2
>>>>>> hugetlb_unmap_file_folio exit_mmap
>>>>>> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
>>>>>> check vma is wanted.
>>>>>> unmap_vmas
>>>>>> mmap_read_unlock(skipped_mm); mmap_read_unlock
>>>>>> mmap_write_lock(mm);
>>>>>> free_pgtables
>>>>>> remove_vma
>>>>>> hugetlb_vma_lock_free
>>>>>> vma, hugetlb_vma_lock is still *used after free*
>>>>>> mmap_write_unlock(mm);
>>>>>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
>>>>>
>>>>> In the retry case, we are OK because go back and look up the vma again. Right?
>>>>>
>>>>> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
>>>>> Before that, we do the following:
>>>>>
>>>>>>> + hugetlb_vma_lock_write(vma);
>>>>>>> + i_mmap_lock_write(mapping);
>>>>>
>>>>> IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we
>>>>
>>>> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
>>>> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.
>>>>
>>>>> can.
>>>>>
>>>>>>> + mmap_read_unlock(skipped_mm);
>>>>>>> + mmdrop(skipped_mm);
>>>>>
>>>>> We continue to hold i_mmap_lock_write as we goto retry.
>>>>>
>>>>> I could be missing something as well. This was how I intended to keep
>>>>> vma valid while dropping and acquiring locks.
>>>>
>>>> Thanks for your clarifying.
>>>>
>>>
>>> Well, that was all correct 'in theory' but not in practice. I did not take
>>> into account the inode lock that is taken at the beginning of truncate (or
>>> hole punch). In other code paths, we take inode lock after mmap_lock. So,
>>> taking mmap_lock here is not allowed.
>>
>> Considering the Lock ordering in mm/filemap.c:
>>
>> * ->i_rwsem
>> * ->invalidate_lock (acquired by fs in truncate path)
>> * ->i_mmap_rwsem (truncate->unmap_mapping_range)
>>
>> * ->i_rwsem (generic_perform_write)
>> * ->mmap_lock (fault_in_readable->do_page_fault)
>>
>> It seems inode_lock is taken before the mmap_lock?
>
> Hmmmm? I can't find a sequence where inode_lock is taken after mmap_lock.
> lockdep was complaining about taking mmap_lock after i_rwsem in the above code.
> I assumed there was such a sequence somewhere. Might need to go back and get
> another trace/warning.

Sorry, I'm somewhat confused. Take generic_file_write_iter() as an example:

generic_file_write_iter
inode_lock(inode); -- *inode lock is held here*
__generic_file_write_iter
generic_perform_write
fault_in_iov_iter_readable -- *may cause page fault and thus take mmap_lock*
inode_unlock(inode);

This is the documented example in the mm/filemap.c. So we should take inode_lock before
taking mmap_lock. Or this is out-dated ? And above example needs a fix?

>
> In any case, I think the scheme below is much cleaner. Doing another round of
> benchmarking before sending.

That should be a good alternative. Thanks for your work. :)

Thanks,
Miaohe Lin

>
>>> I came up with another way to make this work. As discussed above, we need to
>>> drop the i_mmap lock before acquiring the vma_lock. However, once we drop
>>> i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a
>>> ref counted structure that can live on after the vma is freed. Therefore,
>>> this code can take a reference while under i_mmap then drop i_mmap and wait
>>> on the vma_lock. Of course, once it acquires the vma_lock it needs to check
>>> and make sure the vma still exists. It may sound complicated, but I think
>>> it is a bit simpler than the code here. A new series will be out soon.
>>>
>