Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

From: zhangliang (AG)
Date: Sun Jan 16 2022 - 21:11:35 EST


Hi, thank you very much for your patient reply, and I completely agree with the viewpoint from you and Linus about COW and reuse_swap_page(). :)

On 2022/1/14 19:23, David Hildenbrand wrote:
> On 14.01.22 06:00, zhangliang (AG) wrote:
>>
>>
>> On 2022/1/14 0:48, David Hildenbrand wrote:
>>> On 13.01.22 17:37, Linus Torvalds wrote:
>>>> On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>> Let's bring Linus in on this, but I think this reintroduces all of the
>>>>> mapcount problems that we've been discussing recently.
>>>>>
>>>>> How about this as an alternative?
>>>>
>>>> No, at that point reuse_swap_page() is the better thing to do.
>>>>
>>>> Don't play games with page_count() (or even worse games with
>>>> swap_count). The page count is only stable if it's 1. Any other value
>>>> means that it can fluctuate due to concurrent lookups, some of which
>>>> can be done locklessly under RCU.
>>>
>>> I'm pretty sure the patch as is will reintroduce the CVE. So I think in
>>
>> Actually, I wonder how reuse_swap_page() in this patch can reintroduce CVE,
>> I think the invoking logic here is as same as that in do_swap_page().
>> So, could you give me some hint about this? Thanks :)
>
> The CVE essentially is
>
> a) Process A maps an anon page
> -> refcount = 1, mapcount = 1
> b) Process A forks process B
> -> Page mapped R/O
> -> refcount = 2, mapcount = 2
> c) Process B pins the page R/O (e.g., vmsplice)
> -> refcount > 2, mapcount = 2
> d) Process B unmaps the page
> -> refcount > 1, mapcount = 1
> e) Process A writes to the page
>
> If e) ends up reusing the page for process A (e.g., because mapcount == 1),
> process B can observe the memory modifications via the
> page pinning
>
> As I previously said, our COW logic is completely inconsistent
> and should be streamlined. Regarding the CVE, what would have
> to happen is that we put the page into the swapcache and unmap
> from both processes after c). Then, do_swap_page() would reuse
> the page during e) similarly.
>
> We do have a check in place that protects against pinned pages
> not being able to enter the swapcache, but IIRC it can race with
> GUP, so it might race with c) eventually, meaning that with swap
> there might still be a small chance for the CVE in current upstream IIRC.
>
> I think with your patch we could trick do_wp_page() do go down the
> reuse_swap_page() path and reuse the page, even though there is a pending
> pin on the page from process B.
>
>
> a) Process A maps an anon page
> -> refcount = 1, mapcount = 1
> b) Page is added to the swapcache and unmapped
> -> refcount = 1, mapcount = 0
> c) Process A reads page
> -> do_swap_page() will map the page R/O
> -> the page might remain in the swapcache
> -> refcount = 2, mapcount = 1
> d) Process A forks process B
> -> Page mapped R/O
> -> refcount = 3, mapcount = 2
> e) Process B pins the page R/O (e.g., vmsplice)
> -> refcount > 3, mapcount = 2
> d) Process B unmaps the page
> -> refcount > 2, mapcount = 1
> e) Process A writes to the page
>
> In e), we would go into reuse_swap_page() and could succeed, being left with
> refcount > 1 and mapcount = 1.
>
> But maybe I am missing something :)
>
>
>
> There are only two real ways I know to handle the scenario above:
> (1) Rely on page_count() == 1 in COW logic to decide whether we can
> reuse a page. This comes with false negatives, some being harmful,
> some just being a performance problem.
> (2) Rely on mapcount(), swapcount() ... when deciding to reuse a page
> and when it's safe to take a GUP pin.
>
> (2) was discussed in
> https://lkml.kernel.org/r/20211217113049.23850-1-david@xxxxxxxxxx
> but was essentially NAKed by Linus in the current form (well, it didn't
> consider the swapcount so it was just buggy, fortunately Linus pointed that
> out).
>
> To handle problematic false negatives with (1) I'm looking into marking
> anon pages as exclusively owned by a single process, such that we can simply
> reuse them in the COW path -- PageAnonExclusive() also discussed inside that
> posting. That posting also contains a test case which tests all different
> variants of the "child can observe MAP_PRIVAT modifications of the parent"
> we know of (e.g., THP, hugetlb), which you might use to verify if the issue is
> reintroduced by your patch or not.
>
>
> Yesterday, I played with removing reuse_swap_page() completely
> and streamlining our COW logic like so:
>
> (based on https://www.spinics.net/lists/linux-mm/msg279460.html)
>
>
> commit f1fa31a40b13ade5cd93ceb40daadf1196526145
> Author: David Hildenbrand <david@xxxxxxxxxx>
> Date: Fri Jan 14 09:29:52 2022 +0100
>
> mm: optimize do_wp_page() for exclusive pages in the swapcache
>
> Let's optimize for a page with a single user that has been added to the
> swapcache. Try removing the swapcache reference if there is hope that
> we're the exclusive user, but keep the page_count(page) == 1 check in
> place.
>
> Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page()
> users next.
>
> While at it, remove the superfluous page_mapcount() check: it's
> implicitly covered by the page_count() for ordinary anon pages.
>
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e8e2144cbfa6..bd2af7a36791 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3290,12 +3290,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> if (PageAnon(vmf->page)) {
> struct page *page = vmf->page;
>
> - /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + /*
> + * PageKsm() doesn't necessarily raise the page refcount.
> + *
> + * These checks are racy as long as we haven't locked the page;
> + * they are a pure optimization to avoid trying to lock the page
> + * and trying to free the swap cache when there is little hope
> + * it will actually result in a refcount of 1.
> + */
> + if (PageKsm(page) || page_count(page) <= 1 + PageSwapCache(page))
> goto copy;
> if (!trylock_page(page))
> goto copy;
> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + if (PageSwapCache(page))
> + try_to_free_swap(page);
> + if (PageKsm(page) || page_count(page) != 1) {
> unlock_page(page);
> goto copy;
> }
>
>
>
> Followed by
>
>
>
> commit 754aa8f04f7b6d1ffc2b9844be63bbf7b527929f (HEAD)
> Author: David Hildenbrand <david@xxxxxxxxxx>
> Date: Fri Jan 14 09:53:20 2022 +0100
>
> mm: streamline COW logic in do_swap_page()
>
> Let's apply the same COW logic as in do_wp_page(), conditionally trying to
> remove the page from the swapcache after freeing the swap entry, however,
> before actually mapping our page. We can change the order now that
> reuse_swap_page() is no longer used.
>
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index bd2af7a36791..a1bd2b5c818a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3489,6 +3489,24 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> return 0;
> }
>
> +static inline bool should_try_to_free_swap(struct page *page,
> + struct vm_area_struct *vma,
> + unsigned int fault_flags)
> +{
> + if (!PageSwapCache(page))
> + return false;
> + if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
> + PageMlocked(page))
> + return true;
> + /*
> + * We must only reuse the page if the page_count() is 1. Try to
> + * free the swapcache to get rid of the swapcache reference in case
> + * it's likely that we will succeed.
> + */
> + return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
> + page_count(page) == 2;
> +}
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3612,7 +3630,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /*
> - * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
> + * Make sure try_to_free_swap or swapoff did not
> * release the swapcache from under us. The page pin, and pte_same
> * test below, are not enough to exclude that. Even if it is still
> * swapcache, we need to check that the page's swap has not changed.
> @@ -3644,19 +3662,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /*
> - * The page isn't present yet, go ahead with the fault.
> - *
> - * Be careful about the sequence of operations here.
> - * To get its accounting right, reuse_swap_page() must be called
> - * while the page is counted on swap but not yet in mapcount i.e.
> - * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
> - * must be called after the swap_free(), or it will never succeed.
> + * Remove the swap entry and conditionally try to free up the swapcache.
> + * We're already holding a reference on the page but haven't mapped it
> + * yet. We won't be waiting on writeback, so if there is concurrent
> + * writeback we won't map the page writable just now.
> */
> + swap_free(entry);
> + if (should_try_to_free_swap(page, vma, vmf->flags))
> + try_to_free_swap(page);
>
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> pte = mk_pte(page, vma->vm_page_prot);
> - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> +
> + /* Same logic as in do_wp_page(). */
> + if ((vmf->flags & FAULT_FLAG_WRITE) &&
> + !PageKsm(page) && page_count(page) == 1) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> ret |= VM_FAULT_WRITE;
> @@ -3681,10 +3702,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
> }
>
> - swap_free(entry);
> - if (mem_cgroup_swap_full(page) ||
> - (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> - try_to_free_swap(page);
> unlock_page(page);
> if (page != swapcache && swapcache) {
> /*
>
>
>
>
> And followed by more patches that rip out reuse_swap_page() completely.
>
>
> But I still have to understand what's safe to do and not do in do_swap_page(),
> and which effect it would have. Especially PTE-mapped THP are a head-scratcher
> for me, and the interaction with THP in the swapcache before/after actual swap.
>
> So consider above a WIP prototype that's mostly untested.
>

--
Best Regards,
Liang Zhang