Re: [PATCH 7/11] ksm: make KSM page migration possible

From: Hugh Dickins
Date: Fri Feb 08 2013 - 15:52:09 EST


Paul, I've added you to the Cc in the hope that you can shed your light
on an smp_read_barrier_depends() question with which Mel taxes me below.
You may ask for more context: linux-next currently has an mm/ksm.c after
this patch is applied, but you may have questions beyond that - thanks!

On Tue, 5 Feb 2013, Mel Gorman wrote:
> On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote:
> > KSM page migration is already supported in the case of memory hotremove,
> > which takes the ksm_thread_mutex across all its migrations to keep life
> > simple.
> >
> > But the new KSM NUMA merge_across_nodes knob introduces a problem, when
> > it's set to non-default 0: if a KSM page is migrated to a different NUMA
> > node, how do we migrate its stable node to the right tree? And what if
> > that collides with an existing stable node?
> >
> > So far there's no provision for that, and this patch does not attempt
> > to deal with it either. But how will I test a solution, when I don't
> > know how to hotremove memory?
>
> Just reach in and yank it straight out with a chisel.

:)

>
> > The best answer is to enable KSM page
> > migration in all cases now, and test more common cases. With THP and
> > compaction added since KSM came in, page migration is now mainstream,
> > and it's a shame that a KSM page can frustrate freeing a page block.
> >
>
> THP will at least check if migration within a node works. It won't
> necessarily check we can migrate across nodes properly but it's a lot
> better than nothing.

No, I went back and dug out a hack-patch I was using three or four years
ago: occasionally on fault, just migrate every possible page in that mm
for no reason other than to test page migration.

> > static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
> > {
> > struct page *page;
> > void *expected_mapping;
> > + unsigned long kpfn;
> >
> > - page = pfn_to_page(stable_node->kpfn);
> > expected_mapping = (void *)stable_node +
> > (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> > - if (page->mapping != expected_mapping)
> > - goto stale;
> > - if (!get_page_unless_zero(page))
> > +again:
> > + kpfn = ACCESS_ONCE(stable_node->kpfn);
> > + page = pfn_to_page(kpfn);
> > +
>
> Ok.
>
> There should be no concern that hot-remove made the kpfn invalid because
> those stable tree entries should have been discarded.

Yes.

>
> > + /*
> > + * page is computed from kpfn, so on most architectures reading
> > + * page->mapping is naturally ordered after reading node->kpfn,
> > + * but on Alpha we need to be more careful.
> > + */
> > + smp_read_barrier_depends();
>
> The value of page is data dependant on pfn_to_page(). Is it really possible
> for that to be re-ordered even on Alpha?

My intuition (to say "understanding" would be an exaggeration) is that
on Alpha a very old value of page->mapping (in the line below) might be
lying around and read from one cache, which has not necessarily been
invalidated by ksm_migrate_page() pointing stable_node->kpfn to this
new page.

And if that happens, we could easily and mistakenly conclude that this
stable node is stale: although there's an smp_rmb() after goto stale,
stable_node->kpfn would still match kpfn, and we wrongly remove the node.

My confidence that I've expressed that clearly in words, is lower than
my confidence that I've coded it right; and if I'm wrong, yes, surely
it's better to remove any cargo-cult smp_read_barrier_depends().

>
> > + if (ACCESS_ONCE(page->mapping) != expected_mapping)
> > goto stale;
> > - if (page->mapping != expected_mapping) {
> > +
> > + /*
> > + * We cannot do anything with the page while its refcount is 0.
> > + * Usually 0 means free, or tail of a higher-order page: in which
> > + * case this node is no longer referenced, and should be freed;
> > + * however, it might mean that the page is under page_freeze_refs().
> > + * The __remove_mapping() case is easy, again the node is now stale;
> > + * but if page is swapcache in migrate_page_move_mapping(), it might
> > + * still be our page, in which case it's essential to keep the node.
> > + */
> > + while (!get_page_unless_zero(page)) {
> > + /*
> > + * Another check for page->mapping != expected_mapping would
> > + * work here too. We have chosen the !PageSwapCache test to
> > + * optimize the common case, when the page is or is about to
> > + * be freed: PageSwapCache is cleared (under spin_lock_irq)
> > + * in the freeze_refs section of __remove_mapping(); but Anon
> > + * page->mapping reset to NULL later, in free_pages_prepare().
> > + */
> > + if (!PageSwapCache(page))
> > + goto stale;
> > + cpu_relax();
> > + }
>
> The recheck of stable_node->kpfn check after a barrier distinguishes between
> a free and a completed migration, that's fine. I'm hesitate to ask because
> it must be obvious but where is the guarantee that a KSM page is in the
> swap cache?

Certainly none at all: it's the less common case that a KSM page is in
swap cache. But if it is not in swap cache, how could its page count be
0 (causing get_page_unless_zero to fail)? By being free, or well on its
way to being freed (hence stale); or reused as part of a compound page
(hence stale also); or reused for another purpose which arrives at a
page_freeze_refs() (hence stale also); other cases?

It's hard to see from the diff, but in the original version of
get_ksm_page(), !get_page_unless_zero goes straight to stale.

Don't for a moment imagine that this function sprang fully formed
from my mind: it was hard to get it working right (the swap cache
get_page_unless_zero failure during migration really caught me out),
and then to pare it down to its fairly simple final form.

Hugh

>
> > +
> > + if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> > put_page(page);
> > goto stale;
> > }
> > +
> > if (locked) {
> > lock_page(page);
> > - if (page->mapping != expected_mapping) {
> > + if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> > unlock_page(page);
> > put_page(page);
> > goto stale;
> > }
> > }
> > return page;
> > +
> > stale:
> > + /*
> > + * We come here from above when page->mapping or !PageSwapCache
> > + * suggests that the node is stale; but it might be under migration.
> > + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
> > + * before checking whether node->kpfn has been changed.
> > + */
> > + smp_rmb();
> > + if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
> > + goto again;
> > remove_node_from_stable_tree(stable_node);
> > return NULL;
> > }
> > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa
> > if (stable_node) {
> > VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> > stable_node->kpfn = page_to_pfn(newpage);
> > + /*
> > + * newpage->mapping was set in advance; now we need smp_wmb()
> > + * to make sure that the new stable_node->kpfn is visible
> > + * to get_ksm_page() before it can see that oldpage->mapping
> > + * has gone stale (or that PageSwapCache has been cleared).
> > + */
> > + smp_wmb();
> > + set_page_stable_node(oldpage, NULL);
> > }
> > }
> > #endif /* CONFIG_MIGRATION */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/