Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case ofvma_merge succeeding in copy_vma

From: Andrea Arcangeli
Date: Tue Nov 08 2011 - 19:26:25 EST


On Mon, Nov 07, 2011 at 04:28:08PM +0000, Mel Gorman wrote:
> Note that I didn't suddenly turn that ack into a nack although

:)

> 1) A small comment on why we need to call anon_vma_moveto_tail in the
> error path would be nice

I can add that.

> 2) It is unfortunate that we need the faulted_in_anon_vma just
> for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM
> but not earth shatting

It should be optimized away at build time. It thought it was better
not to leave that path without a VM_BUG_ON. It should be a slow path
in the first place (probably we should even mark it unlikely). And
it's obscure enough that I think a check will clarify things. In the
common case (i.e. some pte faulted in) that vma_merge on self if it
succeeds, it couldn't possibly be safe because the vma->vm_pgoff vs
page->index linearity couldn't be valid for the same vma and the same
page on two different virtual addresses. So checking for it I think is
sane. Especially given at some point it was mentioned we could
optimize away the check all together, so it's a bit of an obscure path
that the VM_BUG_ON I think will help document (and verify).

> What I said was taking the anon_vma lock may be slower but it was
> generally easier to understand. I'm happy with the new patch too
> particularly as it keeps the "ordering game" consistent for fork
> and mremap but I previously missed move_page_tables in the error
> path so was worried if there was something else I managed to miss
> particularly in light of the "Allocating a new vma, copy first and
> merge later" direction.

I liked that direction a lot. I thought with that we could stick to
the exact same behavior of fork and not need to reorder stuff. But the
error path is still in the way, and we've to undo the move in place
without tearing down the vmas. Plus it would have required to write
mode code, and the allocation path wouldn't have necessarily been
faster than a reordering if the list is not huge.

> I'm also prefectly happy with my human meat brain and do not expect
> to replace it with an aliens.

8-)

On a totally different but related topic, unmap_mapping_range_tree
walks the prio tree the same way try_to_unmap_file walks it and if
truncate can truncate "dst" before "src" then supposedly the
try_to_unmap_file could miss a migration entry copied into the "child"
ptep while fork runs too... But I think there is no risk there because
we don't establish migration ptes there, and we just unmap the
pagecache, so worst case we'll abort migration if the race trigger and
we'll retry later. But I wonder what happens if truncate runs against
fork, if truncate can drop ptes from dst before src (like mremap
comment says), we could still end up with some pte mapped to the file
in the ptes of the child, even if the pte was correctly truncated in
the parent...

Overall I think fork/mremap vs fully_reliable_rmap_walk/truncate
aren't fundamentally different in relation. If we relay on ordering
for anon pages in fork it's not adding too much mess to also relay on
ordering for mremap. If we take the i_mmap_mutex in mremap because we
can't enforce a order in the prio tree, then we need the i_mmap_mutex
in fork too (and that's missing). But nothing prevents us to use a
lock in mreamp and ordering in fork. I think the decision should be
based more on performance expectations.

So we could add the ordering to mremap (patch posted), and add the
i_mmap_mutex to fork, or we add the anon_vma lock in both mremap and
fork, and the i_mmap_lock to fork.

Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.

Keeping the anon and file cases separated it's better though, I think
the patch I posted should close the race and be ok. Pending your
requested changes. If you think taking the lock is faster it's fine
with me, but I think taking the anon_vma lock once per VMA (plus the
anon_vma_chail list walk), and reduce the per-pagetable locking
overhead is better. Ideally the anon_vma_chain lists won't be long
anyway. And if they are long and lots of processes do mremap at the
same time it should still work better. The anon_vma root lock is not
so small lock to take and better not to take it repeatedly. I also
recall Andi's patches to try to avoid doing lock/unlock in a tight
loop, if we take it and we do some work with it hold, is likely better
than bouncing it at high freq across CPUs for each pmd.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/