Re: [rfc] SLOB memory ordering issue

From: Linus Torvalds
Date: Fri Oct 17 2008 - 16:30:26 EST



Ok, finally looked at this.

There is indeed a locking bug there. "anon_vma_prepare()" optimistically
looks at vma->anon_vma without taking the &mm->page_table_lock. That's not
right.

Of course, we could just take the lock, but in this case it's probably ok
to just admit that we have a lockless algorithm. But that implies that we
need to do the right memory ordering.

And that, in turn, doesn't just imply a "smp_wmb()" - if you do memory
ordering, you need to do it on *both* sides, so now the other side needs
to also do a matching smp_rmb(). Or, in this case, smp_rmb_depends(), I
guess.

That, btw, is an important part of memory ordering. You can never do
ordering on just one side. A "smp_wmb()" on its own is always nonsensical.
It always needs to be paired with a "smp_rmb()" variant.

Something like the appended may fix it.

But I do think we have a potential independent issue with the new page
table lookup code now that it's lock-free. We have the smp_rmb() calls in
gup_get_pte() (at least on x86), when we look things up, but we don't
actually have a lot of smp_wmb()'s there when we insert the page.

For the anonymous page case, we end up doing a

page_add_new_anon_rmap();

before we do the set_pte_at() that actually exposes it, and that does the
whole

page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);

thing, but there is no write barrier between those and the actual write to
the page tables, so when GUP looks up the page, it can't actually depend
on page->mappign or anything else!

Now, this really isn't an issue on x86, since smp_wmb() is a no-op, and
the compiler won't be re-ordering the writes, but in general I do think
that now that we do lockless lookup of pages from the page tables, we
probably do need smp_wmb()'s there just in front of the "set_pte_at()"
calls.

NOTE NOTE! The patch below is only about "page->anon_vma", not about the
GUP lookup and page->mapping/index fields. That's an independent issue.

And notice? This has _nothing_ to do with constructors or allocators.

And of course - this patch is totally untested, and may well need some
thinking about.

Linus

---
mm/rmap.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0383acf..21d09bb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -81,6 +81,13 @@ int anon_vma_prepare(struct vm_area_struct *vma)
/* page_table_lock to protect against threads */
spin_lock(&mm->page_table_lock);
if (likely(!vma->anon_vma)) {
+ /*
+ * We hold the mm->page_table_lock, but another
+ * CPU may be doing an optimistic load (the one
+ * at the top), and we want to make sure that
+ * the anon_vma changes are visible.
+ */
+ smp_wmb();
vma->anon_vma = anon_vma;
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
allocated = NULL;
@@ -92,6 +99,17 @@ int anon_vma_prepare(struct vm_area_struct *vma)
if (unlikely(allocated))
anon_vma_free(allocated);
}
+ /*
+ * Subtle: we looked up anon_vma without any locking
+ * (in the comon case), and are going to look at the
+ * spinlock etc behind it. In order to know that it's
+ * initialized, we need to do a read barrier here.
+ *
+ * We can use the cheaper "depends" version, since we
+ * are following a pointer, and only on alpha may that
+ * give a stale value.
+ */
+ smp_read_barrier_depends();
return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/