Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmasof a mergeable VMA

From: Linus Torvalds
Date: Mon Apr 12 2010 - 12:32:24 EST




On Mon, 12 Apr 2010, Linus Torvalds wrote:
>
> Yes, and that is supposed to be a no-no. The page is clearly associated
> with the vma in question (since we are unmapping it through that vma), but
> the vma list of 'anon_vma's doesn't actually have the one that
> 'page->mapping' points to.
>
> And that, in turn, means that we've lost sight of the 'page->mapping'
> anon_vma, and THAT in turn means that it could well have been free'd as
> being no longer referenced.
>
> And if it was free'd, it could be re-allocated as something else (after
> the RCU grace period), and that directly explains your oops.

I have a new theory. And this new theory is completely different from all
the other things we've been looking at.

The new theory is really simple: 'page->mapping' has been re-set to the
wrong mapping.

Now, there is one case where we reset page->mapping _intentionally_,
namely in the COW-breaking case of having the last user
("page_move_anon_rmap"). And that looks fine, and happens under normal
loads all the time. We _want_ to do it there.

But there is a _much_ more subtle case that involved swapping.

So guys, here's my fairly simple theory on what happens:

- page gets allocated/mapped by process A. Let's call the anon_vma we
associate the page with 'A' to keep it easy to track.

- Process A forks, creating process B. The anon_vma in B is 'B', and has
a chain that looks like 'B' -> 'A'. Everything is fine.

- Swapping happens. The page (with mapping pointing to 'A') gets swapped
out (perhaps not to disk - it's enough to assume that it's just not
mapped any more, and lives entirely in the swap-cache)

- Process B pages it in, which goes like this:

do_swap_page ->
page = lookup_swap_cache(entry);
...
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);

And think about what happens here!

In particular, what happens is that this will now be the "first" mapping
of that page, so page_add_anon_rmap() will do

if (first)
__page_set_anon_rmap(page, vma, address);

and notice what anon_vma it will use? It will use the anon_vma for process
B!

So now page->mapping actually points to anon_vma 'B', not 'A' like it used
to.

What happens then? Trivial: process 'A' also pages it in (nothing happens,
it's not the first mapping), and then process 'B' execve's or exits or
unmaps, making anon_vma B go away.

End result: process A has a page that points to anon_vma B, but anon_vma B
does not exist any more. This can go on forever. Forget about RCU grace
periods, forget about locking, forget anything like that. The bug is
simply that page->mapping points to an anon_vma that was correct at one
point, but was _not_ the one that was shared by all users of that possible
mapping.

The patch below is my largely mindless try at fixing this. It's untested.
I'm not entirely sure that it actually works. But it makes some amount of
conceptual sense. No?

Linus

---
mm/rmap.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ee97d38..4bad326 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page,
static void __page_set_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
- struct anon_vma *anon_vma = vma->anon_vma;
+ struct anon_vma_chain *avc;
+ struct anon_vma *anon_vma;
+
+ BUG_ON(!vma->anon_vma);
+
+ /*
+ * We must use the _oldest_ possible anon_vma for the page mapping!
+ *
+ * So take the last AVC chain entry in the vma, which is the deepest
+ * ancestor, and use the anon_vma from that.
+ */
+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+ anon_vma = avc->anon_vma;

- BUG_ON(!anon_vma);
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/