Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count

From: Vinayak Menon
Date: Tue Nov 06 2018 - 04:29:04 EST


On 11/5/2018 11:52 PM, Laurent Dufour wrote:
> Le 05/11/2018 Ã 08:04, vinayak menon a ÃcritÂ:
>> Hi Laurent,
>>
>> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
>> <ldufour@xxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> The VMA sequence count has been introduced to allow fast detection of
>>> VMA modification when running a page fault handler without holding
>>> the mmap_sem.
>>>
>>> This patch provides protection against the VMA modification done in :
>>> ÂÂÂÂÂÂÂÂ - madvise()
>>> ÂÂÂÂÂÂÂÂ - mpol_rebind_policy()
>>> ÂÂÂÂÂÂÂÂ - vma_replace_policy()
>>> ÂÂÂÂÂÂÂÂ - change_prot_numa()
>>> ÂÂÂÂÂÂÂÂ - mlock(), munlock()
>>> ÂÂÂÂÂÂÂÂ - mprotect()
>>> ÂÂÂÂÂÂÂÂ - mmap_region()
>>> ÂÂÂÂÂÂÂÂ - collapse_huge_page()
>>> ÂÂÂÂÂÂÂÂ - userfaultd registering services
>>>
>>> In addition, VMA fields which will be read during the speculative fault
>>> path needs to be written using WRITE_ONCE to prevent write to be split
>>> and intermediate values to be pushed to other CPUs.
>>>
>>> Signed-off-by: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>
>>> ---
>>> Â fs/proc/task_mmu.c |Â 5 ++++-
>>> Â fs/userfaultfd.cÂÂ | 17 +++++++++++++----
>>> Â mm/khugepaged.cÂÂÂ |Â 3 +++
>>> Â mm/madvise.cÂÂÂÂÂÂ |Â 6 +++++-
>>> Â mm/mempolicy.cÂÂÂÂ | 51 ++++++++++++++++++++++++++++++++++-----------------
>>> Â mm/mlock.cÂÂÂÂÂÂÂÂ | 13 ++++++++-----
>>> Â mm/mmap.cÂÂÂÂÂÂÂÂÂ | 22 +++++++++++++---------
>>> Â mm/mprotect.cÂÂÂÂÂ |Â 4 +++-
>>> Â mm/swap_state.cÂÂÂ |Â 8 ++++++--
>>> Â 9 files changed, 89 insertions(+), 40 deletions(-)
>>>
>>> Â struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct vm_fault *vmf)
>>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long *start,
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long *end)
>>> Â {
>>> -ÂÂÂÂÂÂ *start = max3(lpfn, PFN_DOWN(vma->vm_start),
>>> +ÂÂÂÂÂÂ *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ PFN_DOWN(faddr & PMD_MASK));
>>> -ÂÂÂÂÂÂ *end = min3(rpfn, PFN_DOWN(vma->vm_end),
>>> +ÂÂÂÂÂÂ *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>>> Â }
>>>
>>> --
>>> 2.7.4
>>>
>>
>> I have got a crash on 4.14 kernel with speculative page faults enabled
>> and here is my analysis of the problem.
>> The issue was reported only once.
>
> Hi Vinayak,
>
> Thanks for reporting this.
>
>>
>> [23409.303395]Â el1_da+0x24/0x84
>> [23409.303400]Â __radix_tree_lookup+0x8/0x90
>> [23409.303407]Â find_get_entry+0x64/0x14c
>> [23409.303410]Â pagecache_get_page+0x5c/0x27c
>> [23409.303416]Â __read_swap_cache_async+0x80/0x260
>> [23409.303420]Â swap_vma_readahead+0x264/0x37c
>> [23409.303423]Â swapin_readahead+0x5c/0x6c
>> [23409.303428]Â do_swap_page+0x128/0x6e4
>> [23409.303431]Â handle_pte_fault+0x230/0xca4
>> [23409.303435]Â __handle_speculative_fault+0x57c/0x7c8
>> [23409.303438]Â do_page_fault+0x228/0x3e8
>> [23409.303442]Â do_translation_fault+0x50/0x6c
>> [23409.303445]Â do_mem_abort+0x5c/0xe0
>> [23409.303447]Â el0_da+0x20/0x24
>>
>> Process A accesses address ADDR (part of VMA A) and that results in a
>> translation fault.
>> Kernel enters __handle_speculative_fault to fix the fault.
>> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
>> from speculative path.
>> During this time, another process B which shares the same mm, does a
>> mprotect from another CPU which follows
>> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
>> After the split, ADDR falls into VMA B, but process A is still using
>> VMA A.
>> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
>> swap_vma_readahead->swap_ra_info uses start and end of vma to
>> calculate ptes and nr_pte, which goes wrong due to this and finally
>> resulting in wrong "entry" passed to
>> swap_vma_readahead->__read_swap_cache_async, and in turn causing
>> invalid swapper_space
>> being passed to __read_swap_cache_async->find_get_page, causing an abort.
>>
>> The fix I have tried is to cache vm_start and vm_end also in vmf and
>> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
>> send
>> the patch I am a using if you feel that is the right thing to do.
>
> I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should fÂÂÂ allback to the regular page fault.
>
> The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series.
>

Thanks Laurent. I and going to try this patch.

With this patch, since all non-SWP_SYNCHRONOUS_IO swapins result in non-speculative fault
and a retry, wouldn't this have an impact on some perf numbers ? If so, would caching start
and end be a better option ?

Also, would it make sense to move the FAULT_FLAG_SPECULATIVE check inside swapin_readahead,
in a way that swap_cluster_readahead can take the speculative path ? swap_cluster_readahead
doesn't seem to use vma values.

Thanks,
Vinayak