Re: [PATCH RFC 00/13] mm: COW fixes part 2: reliable GUP pins of anonymous pages

From: David Hildenbrand
Date: Tue Mar 01 2022 - 03:24:53 EST


On 24.02.22 13:26, David Hildenbrand wrote:
> This series is the result of the discussion on the previous approach [2].
> More information on the general COW issues can be found there. It is based
> on [1], which resides in -mm and -next:
> [PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
> THP and swap
>
> I keep the latest state, including some hacky selftest on:
> https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2
>
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table: The
> GUP pin lost synchronicity with the pages mapped into the page tables.
>
> This issue, including other related COW issues, has been summarized in [3]
> under 3):
> "
> 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
>
> page_maybe_dma_pinned() is used to check if a page may be pinned for
> DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
> tolerable, false negatives are problematic: pages that are pinned for
> DMA must not be added to the swapcache. If it happens, the (now pinned)
> page could be faulted back from the swapcache into page tables
> read-only. Future write-access would detect the pinning and COW the
> page, losing synchronicity. For the interested reader, this is nicely
> documented in feb889fb40fa ("mm: don't put pinned pages into the swap
> cache").
>
> Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
> cases and can result in a violation of the documented semantics:
> giving false negatives because of the race.
>
> There are cases where we call it without properly taking a per-process
> sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
> one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
> handle, there is especially one rmap case (shrink_page_list) that's hard
> to fix: in the rmap world, we're not limited to a single process.
>
> The shrink_page_list() issue is really subtle. If we race with
> someone pinning a page, we can trigger the same issue as in the FOLL_GET
> case. See the detail section at the end of this mail on a discussion how
> bad this can bite us with VFIO or other FOLL_PIN user.
>
> It's harder to reproduce, but I managed to modify the O_DIRECT
> reproducer to use io_uring fixed buffers [15] instead, which ends up
> using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
> similarly trigger a loss of synchronicity and consequently a memory
> corruption.
>
> Again, the root issue is that a write-fault on a page that has
> additional references results in a COW and thereby a loss of
> synchronicity and consequently a memory corruption if two parties
> believe they are referencing the same page.
> "
>
> This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
> especially also taking care of concurrent pinning via GUP-fast,
> for example, also fully fixing an issue reported regarding NUMA
> balancing [4] recently. While doing that, it further reduces "unnecessary
> COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
> COW security for hugetlb for FOLL_PIN.
>
> In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
> anonymous page is exclusive. Exclusive anonymous pages that are mapped
> R/O can directly be mapped R/W by the COW logic in the write fault handler.
> Exclusive anonymous pages that want to be shared (fork(), KSM) first have
> to mark a mapped anonymous page shared -- which will fail if there are
> GUP pins on the page. GUP is only allowed to take a pin on anonymous pages
> that is exclusive. The PT lock is the primary mechanism to synchronize
> modifications of PG_anon_exclusive. GUP-fast is synchronized either via the
> src_mm->write_protect_seq or via clear/invalidate+flush of the relevant
> page table entry.
>
> Special care has to be taken about swap, migration, and THPs (whereby a
> PMD-mapping can be converted to a PTE mapping and we have to track
> information for subpages). Besides these, we let the rmap code handle most
> magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
> logic as part of our previous approach [2], however, it's now 100% mapcount
> free and I further simplified it a bit.
>
> #1 is a fix
> #3-#7 are mostly rmap preparations for PG_anon_exclusive handling
> #8 introduces PG_anon_exclusive
> #9 uses PG_anon_exclusive and make R/W pins of anonymous pages
> reliable
> #10 is a preparation for reliable R/O pins
> #11 and #12 is reused/modified GUP-triggered unsharing for R/O GUP pins
> make R/O pins of anonymous pages reliable
> #13 adds sanity check when (un)pinning anonymous pages
>
> I'm not proud about patch #8, suggestions welcome. Patch #9 contains
> excessive explanations and the main logic for R/W pins. #11 and #12
> resemble what we proposed in the previous approach [2]. I consider the
> general approach of #13 very nice and helpful, and I remember Linus even
> envisioning something like that for finding BUGs, although we might want to
> implement the sanity checks eventually differently
>
> It passes my growing set of tests for "wrong COW" and "missed COW",
> including the ones in [30 -- I'd really appreciate some experienced eyes
> to take a close look at corner cases. Only tested on x86_64, testing with
> CONT-mapped hugetlb pages on arm64 might be interesting.
>
> Once we converted relevant users of FOLL_GET (e.g., O_DIRECT) to FOLL_PIN,
> the issue described in [3] under 2) will be fixed as well. Further, once
> that's in place we can streamline our COW logic for hugetlb to rely on
> page_count() as well and fix any possible COW security issues.

Hi,

I did excessive tests on aarch64 with CONT hugetlb pages yesterday and
didn't find any surprises, at least not in these changes. [1]

While thinking on how to get O_DIRECT fixed immediately, I realized the
following:
(1) This series turns FOLL_PIN on anonymous pages completely reliable,
for any case of GUP+fork (GUP before fork, GUP during fork, GUP
after fork in child/parent), which is pretty nice IMHO.
(2) For O_DIRECT and friends (FOLL_GET) we primarily care about fixing
short-term FOLL_WRITE *without* fork(), which are the memory
corruptions we're experiencing. Long-term FOLL_GET (meaning, even
staying reliable after fork()) already has a bad smell to it.

Especially, (2) never worked reliably with fork() involved. I came to
the conclusion that this series pretty much fixes (2) already *except*
the swapout case. I might be wrong, but I think it should be possible to
handle that as well, meaning that: a FOLL_GET|FOLL_WRITE on an anonymous
page will be reliable as long as we don't fork().

I'm planning on sending a part3 to cover that, so we don't have to wait
for the FOLL_PIN conversion to get our O_DIRECT reproducers fixed. Stay
tuned.

If there isn't any more feedback, I'll be sending out a v1 soonish.


[1]
https://lkml.kernel.org/r/811c5c8e-b3a2-85d2-049c-717f17c3a03a@xxxxxxxxxx


--
Thanks,

David / dhildenb