Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case ofvma_merge succeeding in copy_vma

From: Andrea Arcangeli
Date: Fri Nov 04 2011 - 16:55:32 EST


On Fri, Nov 04, 2011 at 12:16:03PM -0700, Hugh Dickins wrote:
> On Fri, 4 Nov 2011, Nai Xia wrote:
> > On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > > On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
> > >> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > >>                */
> > >>               if (vma_start >= new_vma->vm_start &&
> > >>                   vma_start < new_vma->vm_end)
> > >> +                     /*
> > >> +                      * No need to call anon_vma_order_tail() in
> > >> +                      * this case because the same PT lock will
> > >> +                      * serialize the rmap_walk against both src
> > >> +                      * and dst vmas.
> > >> +                      */
> > >
> > > Really?  Please convince me: I just do not see what ensures that
> > > the same pt lock covers both src and dst areas in this case.
> >
> > At the first glance that rmap_walk does travel this merged VMA
> > once...
> > But, Now, Wait...., I am actually really puzzled that this case can really
> > happen at all, you see that vma_merge() does not break the validness
> > between page->index and its VMA. So if this can really happen,
> > a page->index should be valid in both areas in a same VMA.
> > It's strange to imagine that a PTE is copy inside a _same_ VMA
> > and page->index is valid at both old and new places.
>
> Yes, I think you are right, thank you for elucidating it.
>
> That was a real case when we wrote copy_vma(), when rmap was using
> pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
> on anonymous mappings too, it became dead code. With linear vm_pgoff
> matching, you cannot fit a range in two places within the same vma.
> (And even the non-linear case relies upon vm_pgoff defaults.)
>
> So we could simplify the copy_vma() interface a little now (get rid of
> that nasty **vmap): I'm not quite sure whether we ought to do that,
> but certainly Andrea's comment there should be updated (if he also
> agrees with your analysis).

The vmap should only trigger when the prev vma (prev relative to src
vma) is extended at the end to make space for the dst range. And by
extending it, we filled the hole between the prev vma and "src"
vma. So then the prev vma becomes the "src vma" and also the "dst
vma". So we can't keep working with the old "vma" pointer after that.

I doubt it can be removed without crashing in the above case.

I thought some more about it and what I missed I think is the
anon_vma_merge in vma_adjust. What that anon_vma_merge, rmap_walk will
have to complete before we can start moving the ptes. And so rmap_walk
when starts again from scratch (after anon_vma_merge run in
vma_adjust) will find all ptes even if vma_merge succeeded before.

In fact this may also work for fork. Fork will take the anon_vma root
lock somehow to queue the child vma in the same_anon_vma. Doing so it
will serialize against any running rmap_walk from all other cpus. The
ordering has never been an issue for fork anyway, but it would have
have been an issue for mremap in case vma_merge succeeded and src_vma
!= dst_vma, if vma_merge didn't act as a serialization point against
rmap_walk (which I realized now).

What makes it safe is again taking both PT locks simultanously. So it
doesn't matter what rmap_walk searches, as long as the anon_vma_chain
list cannot change by the time rmap_walk started.

What I thought before was rmap_walk checking vma1 and then vma_merge
succeed (where src vma is vma2 and dst vma is vma1, but vma1 is not a
new vma queued at the end of same_anon_vma), move_page_tables moves
the pte from vma2 to vma1, and then rmap_walk checks vma2. But again
vma_merge won't be allowed to complete in the middle of rmap_walk, and
so it cannot trigger and we can safely drop the patch. It wasn't
immediate to think at the locks taken within vma_adjust sorry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/