Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

From: Linus Torvalds
Date: Fri Sep 25 2020 - 16:33:37 EST


On Thu, Sep 24, 2020 at 2:30 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> > >
> > > With the extra mprotect(!WRITE), I think we'll see a !pte_write() entry. Then
> > > it'll not go into maybe_dma_pinned() at all since cow==false.
> >
> > Hum that seems like a problem in this patch, we still need to do the
> > DMA pinned logic even if the pte is already write protected.
>
> Yes I agree. I'll take care of that in the next version too.

You people seem to be worrying too much about crazy use cases.

The fact is, if people do pinning, they had better be careful
afterwards. I agree that marking things MADV_DONTFORK may not be
great, and there may be apps that do it. But honestly, if people then
do mprotect() to make a VM non-writable after pinning a page for
writing (and before the IO has completed), such an app only has itself
to blame.

So I don't think this issue is even worth worrying about. At some
point, when apps do broken things, the kernel says "you broke it, you
get to keep both pieces". Not "Oh, you're doing unreasonable things,
let me help you".

This has dragged out a lot longer than I hoped it would, and I think
it's been over-complicated.

In fact, looking at this all, I'm starting to think that we don't
actually even need the mm_struct.has_pinned logic, because we can work
with something much simpler: the page mapping count.

A pinned page will have the page count increased by
GUP_PIN_COUNTING_BIAS, and my worry was that this would be ambiguous
with the traditional "fork a lot" UNIX style behavior. And that
traditional case is obviously one of the cases we very much don't want
to slow down.

But a pinned page has _another_ thing that is special about it: the
pinning action broke COW.

So I think we can simply add a

if (page_mapcount(page) != 1)
return false;

to page_maybe_dma_pinned(), and that very naturally protects against
the "is the page count perhaps elevated due to a lot of forking?"

Because pinning forces the mapcount to 1, and while it is pinned,
nothing else should possibly increase it - since the only thing that
would increase it is fork, and the whole point is that we won't be
doing that "page_dup_rmap()" for this page (which is what increases
the mapcount).

So we actually already have a very nice flag for "this page isn't
duplicated by forking".

And if we keep the existing early "ptep_set_wrprotect()", we also know
that we cannot be racing with another thread that is pinning at the
same time, because the fast-gup code won't be touching a read-only
pte.

So we'll just have to mark it writable again before we release the
page table lock, and we avoid that race too.

And honestly, since this is all getting fairly late in the rc, and it
took longer than I thought, I think we should do the GFP_ATOMIC
approach for now - not great, but since it only triggers for this case
that really should never happen anyway, I think it's probably the best
thing for 5.9, and we can improve on things later.

Linus