Re: GUP guarantees wrt to userspace mappings redesign

From: Andrea Arcangeli
Date: Mon May 02 2016 - 14:56:59 EST

Next message: Jyri Sarha: "Re: [PATCH 11/14] drm/tilcdc: use drm_crtc_send_vblank_event()"
Previous message: Toshi Kani: "[PATCH v2 2/3] ext2: Add alignment check for DAX mount"
In reply to: Oleg Nesterov: "Re: GUP guarantees wrt to userspace mappings redesign"
Next in thread: Andrea Arcangeli: "Re: [BUG] vfio device assignment regression with THP ref counting redesign"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, May 02, 2016 at 03:14:02PM +0300, Kirill A. Shutemov wrote:
> Quick look around:
>
> - I don't see any check page_count() around __replace_page() in uprobes,
> so it can easily replace pinned page.
>
> - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> take the pin between the check and establishing new pte entry.

* Ok this is tricky, when get_user_pages_fast() run it doesn't
* take any lock, therefore the check that we are going to make
* with the pagecount against the mapcount is racey and
* O_DIRECT can happen right after the check.
* So we clear the pte and flush the tlb before the check
* this assure us that no O_DIRECT can happen after the check
* or in the middle of the check.
*/
entry = ptep_clear_flush_notify(vma, addr, ptep);

KSM takes care of that or it wouldn't be safe if KSM was with memory
under O_DIRECT.

> - khugepaged: the same story as with KSM.

In __collapse_huge_page_isolate we do:

/*
* cannot use mapcount: can't collapse if there's a gup pin.
* The page must only be referenced by the scanned process
* and page swap cache.
*/
if (page_count(page) != 1 + !!PageSwapCache(page)) {
unlock_page(page);
result = SCAN_PAGE_COUNT;
goto out;
}

At that point the pmd has been zapped (pmdp_collapse_flush already
run) and like for KSM case that is enough to ensure
get_user_pages_fast can't succeed and it'll have to call into the slow
get_user_pages.

These two issues are not specific to vfio and IOMMUs, this is must be
correct or O_DIRECT will generate data corruption in presence of
KSM/khugepaged. Both looks fine to me.

> I don't see how we can deliver on the guarantee, especially with lockless
> GUP_fast.

By zapping the pmd_trans_huge/pte and sending IPIs if needed
(get_user_pages_fast runs with irq disabled), before checking
page_count.

With the RCU version of it it's the same, but instead of sending IPIs,
we'll wait for a quiescient point to be sure of having flushed any
concurrent get_user_pages_fast out of the other CPUs, before we
proceed to check page_count (then no other get_user_pages_fast can
increase the page count for this page on this "mm" anymore).

That's how the guaranteed is provided against get_user_pages_fast.

Next message: Jyri Sarha: "Re: [PATCH 11/14] drm/tilcdc: use drm_crtc_send_vblank_event()"
Previous message: Toshi Kani: "[PATCH v2 2/3] ext2: Add alignment check for DAX mount"
In reply to: Oleg Nesterov: "Re: GUP guarantees wrt to userspace mappings redesign"
Next in thread: Andrea Arcangeli: "Re: [BUG] vfio device assignment regression with THP ref counting redesign"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]