Re: [PATCH v4] KSM: numa awareness sysfs knob

From: Petr Holasek
Date: Mon Oct 08 2012 - 19:00:50 EST

Next message: Paul Gortmaker: "Re: [ANNOUNCE] 3.4.12-rt20"
Previous message: Paul E. McKenney: "Re: [PATCH] make CONFIG_EXPERIMENTAL invisible and default"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Hugh,

first of all, please let me apologize for the delay in my response and thank
you for your extensive review!

On Sun, 30 Sep 2012, Hugh Dickins wrote:
> Andrea's point about ksm_migrate_page() is an important one, and I've
> answered that against his mail, but here's some other easier points.
>
> On Mon, 24 Sep 2012, Petr Holasek wrote:
>
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> >

...

> >
> > v4: - merge_nodes was renamed to merge_across_nodes
> > - share_all debug knob was dropped
>
> Yes, you were right to drop the share_all knob for now. I do like the
> idea, but it was quite inappropriate to mix it in with this NUMAnode
> patch. And although I like the idea, I think it wants a bit more: I
> already have a hacky "allksm" boot option patch to mm/mmap.c for my
> own testing, which adds VM_MERGEABLE everywhere. If I gave that up
> (I'd like to!) in favour of yours, I think I would still be missing
> all the mms that are created before there's a chance to switch the
> share_all mode on. Or maybe I misread your v3. Anyway, that's a
> different topic, happily taken off today's agenda.
>

Agreed, it hid original purpose of this patch and made it more difficult for
eventual merging. So let's move it lower on the ksm todo list for this time :)

> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 47c8853..7c82032 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -36,6 +36,7 @@
> > #include <linux/hash.h>
> > #include <linux/freezer.h>
> > #include <linux/oom.h>
> > +#include <linux/numa.h>
> >
> > #include <asm/tlbflush.h>
> > #include "internal.h"
> > @@ -140,7 +141,10 @@ struct rmap_item {
> > unsigned long address; /* + low bits used for flags below */
> > unsigned int oldchecksum; /* when unstable */
> > union {
> > - struct rb_node node; /* when node of unstable tree */
> > + struct {
> > + struct rb_node node; /* when node of unstable tree */
> > + struct rb_root *root;
> > + };
>
> This worries me a little, enlarging struct rmap_item: there may
> be very many of them in the system, best to minimize their size.
>
> (This struct rb_root *root is used for one thing only, isn't it? For the
> unstable rb_erase() in remove_rmap_item_from_tree(). It annoys me that
> we need to record root just for that, but I don't see an alternative.)

Yes, I've played a quite lot with this issue, but wasn't able to find an
alternative solution, too.

>
> The 64-bit case can be easily improved by locating unsigned int nid
> after oldchecksum instead. The 32-bit case (which supports smaller
> NODES_SHIFT: 6 was the largest I found) could be "improved" by keeping
> nid in the lower bits of address along with (moved) UNSTABLE_FLAG and
> STABLE_FLAG and reduced SEQNR_MASK - we really need only 1 bit for that,
> but 2 bits would be nice for keeping the BUG_ON(age > 1).
>
> But you may think I'm going too far there, and prefer just to put
> #ifdef CONFIG_NUMA around the unsigned int nid, so at least it does
> not enlarge the more basic 32-bit configuration.
>

I like your idea of unsigned int nid, will implement it in next version.

...

> >
> > - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> > - struct stable_node *stable_node;
> > + for (i = 0; i < MAX_NUMNODES; i++)
>
> It's irritating to have to do this outer nid loop, but I think you're
> right, that the memory hotremove notification does not quite tell us
> which node to look at. Or can we derive that from start_pfn? Would
> it be safe to assume that end_pfn-1 must be in the same node?
>

I had assumed that we can't rely on end_pfn-1 in the same node, but now
mm/memory_hotremove.c looks to me that we can rely on it because memory hotremove
callback is triggered for each zone where we are removing memory.
So I think yes, we can optimize it in the way you mentioned above. If I am wrong
correct me, please :)

...

>
> Looks nice - thank you.
>

Thanks for your help!

Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul Gortmaker: "Re: [ANNOUNCE] 3.4.12-rt20"
Previous message: Paul E. McKenney: "Re: [PATCH] make CONFIG_EXPERIMENTAL invisible and default"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]