Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing

From: Waiman Long
Date: Tue Nov 12 2019 - 12:27:51 EST


On 11/8/19 8:47 PM, Mike Kravetz wrote:
> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>
>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>> held just in read mode. It is OK to do increments in parallel without
>>>> synchronization. However, we don't want anyone else changing the count
>>>> while that check in huge_pmd_unshare is happening. Hence, the need for
>>>> taking the semaphore in write mode.
>>> This would be a nice addition to the changelog methinks.
>> Last night I remembered there is one place where we currently take
>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That
>> is in try_to_unmap_one. Yes, there is a potential race here today.
> Actually there is no race there today. Callers to huge_pmd_unshare
> hold the page table lock. So, this synchronizes those unshare calls
> from page migration and page poisoning.
>
>> But that race is somewhat contained as you need two threads doing some
>> combination of page migration and page poisoning to race. This change
>> now allows migration or poisoning to race with page fault. I would
>> really prefer if we do not open up the race window in this manner.
> But, we do open a race window by changing huge_pmd_share to take the
> i_mmap_rwsem in read mode as in the original patch.
>
> Here is the additional code needed to take the semaphore in write mode
> for the huge_pmd_unshare calls via try_to_unmap_one. We would need to
> combine this with Longman's patch. Please take a look and provide feedback.
> Some of the changes are subtle, especially the exception for MAP_PRIVATE
> mappings, but I tried to add sufficient comments.
>
> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Date: Fri, 8 Nov 2019 17:25:37 -0800
> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
> call to huge_pmd_unshare in try_to_unmap_one.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> ---
> mm/hugetlb.c | 9 ++++++++-
> mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
> mm/migrate.c | 27 +++++++++++++++++++++++++--
> 3 files changed, 60 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f78891f92765..73d9136549a5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> * indicated by page_count > 1, unmap is achieved by clearing pud and
> * decrementing the ref count. If count == 1, the pte page is not shared.
> *
> - * called with page table lock held.
> + * Must be called while holding page table lock.
> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
> + * This is to prevent races with page faults calling huge_pmd_share which
> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
> + * in read mode. It is possible to call without holding i_mmap_rwsem in
> + * write mode if the caller KNOWS the page table is associated with a private
> + * mapping. This is because private mappings can not share PMDs and can
> + * not race with huge_pmd_share calls during page faults.

So the page table lock here is the huge_pte_lock(). Right? In
huge_pmd_share(), the pte lock has to be taken before one can share it.
So would you mind explaining where exactly is the race?

Thanks,
Longman

> *
> * returns: 1 successfully unmapped a shared pte page
> * 0 the underlying pte page is not shared, or it is the last user
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3151c87dff73..8f52b22cf71b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
> if (kill)
> collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
>
> - unmap_success = try_to_unmap(hpage, ttu);
> + if (!PageHuge(hpage)) {
> + unmap_success = try_to_unmap(hpage, ttu);
> + } else {
> + mapping = page_mapping(hpage);
> + if (mapping) {
> + /*
> + * For hugetlb pages, try_to_unmap could potentially
> + * call huge_pmd_unshare. Because of this, take
> + * semaphore in write mode here and set TTU_RMAP_LOCKED
> + * to indicate we have taken the lock at this higher
> + * level.
> + */
> + i_mmap_lock_write(mapping);
> + unmap_success = try_to_unmap(hpage,
> + ttu|TTU_RMAP_LOCKED);
> + i_mmap_unlock_write(mapping);
> + } else {
> + /*
> + * !mapping implies a MAP_PRIVATE huge page mapping.
> + * Since PMDs will never be shared in a private
> + * mapping, it is safe to let huge_pmd_unshare be
> + * called with the semaphore in read mode.
> + */
> + unmap_success = try_to_unmap(hpage, ttu);
> + }
> + }
> +
> if (!unmap_success)
> pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
> pfn, page_mapcount(hpage));
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4fe45d1428c8..9cae5a4f1e48 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> goto put_anon;
>
> if (page_mapped(hpage)) {
> - try_to_unmap(hpage,
> - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> + struct address_space *mapping = page_mapping(hpage);
> +
> + if (mapping) {
> + /*
> + * try_to_unmap could potentially call huge_pmd_unshare.
> + * Because of this, take semaphore in write mode here
> + * and set TTU_RMAP_LOCKED to indicate we have taken
> + * the lock at this higher level.
> + */
> + i_mmap_lock_write(mapping);
> + try_to_unmap(hpage,
> + TTU_MIGRATION|TTU_IGNORE_MLOCK|
> + TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
> + i_mmap_unlock_write(mapping);
> + } else {
> + /*
> + * !mapping implies a MAP_PRIVATE huge page mapping.
> + * Since PMDs will never be shared in a private
> + * mapping, it is safe to let huge_pmd_unshare be
> + * called with the semaphore in read mode.
> + */
> + try_to_unmap(hpage,
> + TTU_MIGRATION|TTU_IGNORE_MLOCK|
> + TTU_IGNORE_ACCESS);
> + }
> page_was_mapped = 1;
> }
>