Re: [PATCH v4] KSM: numa awareness sysfs knob

From: Petr Holasek
Date: Mon Oct 08 2012 - 19:00:50 EST


Hi Hugh,

first of all, please let me apologize for the delay in my response and thank
you for your extensive review!

On Sun, 30 Sep 2012, Hugh Dickins wrote:
> Andrea's point about ksm_migrate_page() is an important one, and I've
> answered that against his mail, but here's some other easier points.
>
> On Mon, 24 Sep 2012, Petr Holasek wrote:
>
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> >

...

> >
> > v4: - merge_nodes was renamed to merge_across_nodes
> > - share_all debug knob was dropped
>
> Yes, you were right to drop the share_all knob for now. I do like the
> idea, but it was quite inappropriate to mix it in with this NUMAnode
> patch. And although I like the idea, I think it wants a bit more: I
> already have a hacky "allksm" boot option patch to mm/mmap.c for my
> own testing, which adds VM_MERGEABLE everywhere. If I gave that up
> (I'd like to!) in favour of yours, I think I would still be missing
> all the mms that are created before there's a chance to switch the
> share_all mode on. Or maybe I misread your v3. Anyway, that's a
> different topic, happily taken off today's agenda.
>

Agreed, it hid original purpose of this patch and made it more difficult for
eventual merging. So let's move it lower on the ksm todo list for this time :)

> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 47c8853..7c82032 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -36,6 +36,7 @@
> > #include <linux/hash.h>
> > #include <linux/freezer.h>
> > #include <linux/oom.h>
> > +#include <linux/numa.h>
> >
> > #include <asm/tlbflush.h>
> > #include "internal.h"
> > @@ -140,7 +141,10 @@ struct rmap_item {
> > unsigned long address; /* + low bits used for flags below */
> > unsigned int oldchecksum; /* when unstable */
> > union {
> > - struct rb_node node; /* when node of unstable tree */
> > + struct {
> > + struct rb_node node; /* when node of unstable tree */
> > + struct rb_root *root;
> > + };
>
> This worries me a little, enlarging struct rmap_item: there may
> be very many of them in the system, best to minimize their size.
>
> (This struct rb_root *root is used for one thing only, isn't it? For the
> unstable rb_erase() in remove_rmap_item_from_tree(). It annoys me that
> we need to record root just for that, but I don't see an alternative.)

Yes, I've played a quite lot with this issue, but wasn't able to find an
alternative solution, too.

>
> The 64-bit case can be easily improved by locating unsigned int nid
> after oldchecksum instead. The 32-bit case (which supports smaller
> NODES_SHIFT: 6 was the largest I found) could be "improved" by keeping
> nid in the lower bits of address along with (moved) UNSTABLE_FLAG and
> STABLE_FLAG and reduced SEQNR_MASK - we really need only 1 bit for that,
> but 2 bits would be nice for keeping the BUG_ON(age > 1).
>
> But you may think I'm going too far there, and prefer just to put
> #ifdef CONFIG_NUMA around the unsigned int nid, so at least it does
> not enlarge the more basic 32-bit configuration.
>

I like your idea of unsigned int nid, will implement it in next version.

...

> >
> > - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> > - struct stable_node *stable_node;
> > + for (i = 0; i < MAX_NUMNODES; i++)
>
> It's irritating to have to do this outer nid loop, but I think you're
> right, that the memory hotremove notification does not quite tell us
> which node to look at. Or can we derive that from start_pfn? Would
> it be safe to assume that end_pfn-1 must be in the same node?
>

I had assumed that we can't rely on end_pfn-1 in the same node, but now
mm/memory_hotremove.c looks to me that we can rely on it because memory hotremove
callback is triggered for each zone where we are removing memory.
So I think yes, we can optimize it in the way you mentioned above. If I am wrong
correct me, please :)

...

>
> Looks nice - thank you.
>

Thanks for your help!

Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/