Re: [PATCH v4] KSM: numa awareness sysfs knob

From: Hugh Dickins
Date: Sun Sep 30 2012 - 21:38:57 EST


Andrea's point about ksm_migrate_page() is an important one, and I've
answered that against his mail, but here's some other easier points.

On Mon, 24 Sep 2012, Petr Holasek wrote:

> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
>
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
>
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_nodes value is possible only

merge_across_nodes

> when there are not any ksm shared pages in system.
>
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
>
> http://pholasek.fedorapeople.org/alloc_pg.c
>
> Population standard deviations of access times in percentage of average
> were following:
>
> merge_nodes=1

merge_across_nodes

> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes 1.7%
>
> merge_nodes=0

merge_across_nodes

> 2 nodes 1%
> 4 nodes 0.32%
> 8 nodes 0.018%
>
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
>
> Changelog:
>
> v2: Andrew's objections were reflected:
> - value of merge_nodes can't be changed while there are some ksm
> pages in system
> - merge_nodes sysfs entry appearance depends on CONFIG_NUMA
> - more verbose documentation
> - added some performance testing results
>
> v3: - more verbose documentation
> - fixed race in merge_nodes store function
> - introduced share_all debugging knob proposed by Andrew
> - minor cleanups
>
> v4: - merge_nodes was renamed to merge_across_nodes
> - share_all debug knob was dropped

Yes, you were right to drop the share_all knob for now. I do like the
idea, but it was quite inappropriate to mix it in with this NUMAnode
patch. And although I like the idea, I think it wants a bit more: I
already have a hacky "allksm" boot option patch to mm/mmap.c for my
own testing, which adds VM_MERGEABLE everywhere. If I gave that up
(I'd like to!) in favour of yours, I think I would still be missing
all the mms that are created before there's a chance to switch the
share_all mode on. Or maybe I misread your v3. Anyway, that's a
different topic, happily taken off today's agenda.

> - get_kpfn_nid helper
> - fixed page migration behaviour
>
> Signed-off-by: Petr Holasek <pholasek@xxxxxxxxxx>
> ---
> Documentation/vm/ksm.txt | 7 +++
> mm/ksm.c | 135 ++++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 122 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index b392e49..100d58d 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds ksmd should sleep before next scan
> e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> Default: 20 (chosen for demonstration purposes)
>
> +merge_nodes - specifies if pages from different numa nodes can be merged.

merge_across_nodes

> + When set to 0, ksm merges only pages which physically
> + reside in the memory area of same NUMA node. It brings
> + lower latency to access to shared page. Value can be
> + changed only when there is no ksm shared pages in system.
> + Default: 1
> +
> run - set 0 to stop ksmd from running but keep merged pages,
> set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
> set 2 to stop ksmd and unmerge all pages currently merged,
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 47c8853..7c82032 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -36,6 +36,7 @@
> #include <linux/hash.h>
> #include <linux/freezer.h>
> #include <linux/oom.h>
> +#include <linux/numa.h>
>
> #include <asm/tlbflush.h>
> #include "internal.h"
> @@ -140,7 +141,10 @@ struct rmap_item {
> unsigned long address; /* + low bits used for flags below */
> unsigned int oldchecksum; /* when unstable */
> union {
> - struct rb_node node; /* when node of unstable tree */
> + struct {
> + struct rb_node node; /* when node of unstable tree */
> + struct rb_root *root;
> + };

This worries me a little, enlarging struct rmap_item: there may
be very many of them in the system, best to minimize their size.

(This struct rb_root *root is used for one thing only, isn't it? For the
unstable rb_erase() in remove_rmap_item_from_tree(). It annoys me that
we need to record root just for that, but I don't see an alternative.)

The 64-bit case can be easily improved by locating unsigned int nid
after oldchecksum instead. The 32-bit case (which supports smaller
NODES_SHIFT: 6 was the largest I found) could be "improved" by keeping
nid in the lower bits of address along with (moved) UNSTABLE_FLAG and
STABLE_FLAG and reduced SEQNR_MASK - we really need only 1 bit for that,
but 2 bits would be nice for keeping the BUG_ON(age > 1).

But you may think I'm going too far there, and prefer just to put
#ifdef CONFIG_NUMA around the unsigned int nid, so at least it does
not enlarge the more basic 32-bit configuration.

> struct { /* when listed from stable tree */
> struct stable_node *head;
> struct hlist_node hlist;
> @@ -153,8 +157,8 @@ struct rmap_item {
> #define STABLE_FLAG 0x200 /* is listed from the stable tree */
>
> /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES] = { RB_ROOT, };
> +static struct rb_root root_stable_tree[MAX_NUMNODES] = { RB_ROOT, };
>
> #define MM_SLOTS_HASH_SHIFT 10
> #define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
> @@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
> /* Milliseconds ksmd should sleep between batches */
> static unsigned int ksm_thread_sleep_millisecs = 20;
>
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +
> #define KSM_RUN_STOP 0
> #define KSM_RUN_MERGE 1
> #define KSM_RUN_UNMERGE 2
> @@ -447,10 +454,25 @@ out: page = NULL;
> return page;
> }
>
> +/*
> + * This helper is used for getting right index into array of tree roots.
> + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> + * every node has its own stable and unstable tree.
> + */
> +static inline int get_kpfn_nid(unsigned long kpfn)
> +{
> + if (ksm_merge_across_nodes)
> + return 0;
> + else
> + return pfn_to_nid(kpfn);
> +}
> +
> static void remove_node_from_stable_tree(struct stable_node *stable_node)
> {
> struct rmap_item *rmap_item;
> struct hlist_node *hlist;
> + int nid;
>
> hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
> if (rmap_item->hlist.next)
> @@ -462,7 +484,10 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
> cond_resched();
> }
>
> - rb_erase(&stable_node->node, &root_stable_tree);
> + nid = get_kpfn_nid(stable_node->kpfn);
> +
> + rb_erase(&stable_node->node,
> + &root_stable_tree[nid]);

Oh, if I ever get my fingers on that, it's going to join the line above :)

> free_stable_node(stable_node);
> }
>
> @@ -560,7 +585,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
> age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> BUG_ON(age > 1);
> if (!age)
> - rb_erase(&rmap_item->node, &root_unstable_tree);
> + rb_erase(&rmap_item->node, rmap_item->root);

Right, this is where we have to use rmap_item->root
or &root_unstable_tree[rmap_item->nid].

>
> ksm_pages_unshared--;
> rmap_item->address &= PAGE_MASK;
> @@ -989,8 +1014,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
> */
> static struct page *stable_tree_search(struct page *page)
> {
> - struct rb_node *node = root_stable_tree.rb_node;
> + struct rb_node *node;
> struct stable_node *stable_node;
> + int nid;
>
> stable_node = page_stable_node(page);
> if (stable_node) { /* ksm page forked */
> @@ -998,6 +1024,9 @@ static struct page *stable_tree_search(struct page *page)
> return page;
> }
>
> + nid = get_kpfn_nid(page_to_pfn(page));
> + node = root_stable_tree[nid].rb_node;
> +
> while (node) {
> struct page *tree_page;
> int ret;
> @@ -1032,10 +1061,14 @@ static struct page *stable_tree_search(struct page *page)
> */
> static struct stable_node *stable_tree_insert(struct page *kpage)
> {
> - struct rb_node **new = &root_stable_tree.rb_node;
> + int nid;
> + struct rb_node **new = NULL;

No need to initialize new.

> struct rb_node *parent = NULL;
> struct stable_node *stable_node;
>
> + nid = get_kpfn_nid(page_to_nid(kpage));
> + new = &root_stable_tree[nid].rb_node;
> +
> while (*new) {
> struct page *tree_page;
> int ret;
> @@ -1069,7 +1102,7 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
> return NULL;
>
> rb_link_node(&stable_node->node, parent, new);
> - rb_insert_color(&stable_node->node, &root_stable_tree);
> + rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>
> INIT_HLIST_HEAD(&stable_node->hlist);
>
> @@ -1097,10 +1130,16 @@ static
> struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
> struct page *page,
> struct page **tree_pagep)
> -

Thank you!

> {
> - struct rb_node **new = &root_unstable_tree.rb_node;
> + struct rb_node **new = NULL;

No need to initialize new.

> + struct rb_root *root;
> struct rb_node *parent = NULL;
> + int nid;
> +
> + nid = get_kpfn_nid(page_to_pfn(page));
> + root = &root_unstable_tree[nid];
> +
> + new = &root->rb_node;
>
> while (*new) {
> struct rmap_item *tree_rmap_item;
> @@ -1138,8 +1177,9 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>
> rmap_item->address |= UNSTABLE_FLAG;
> rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
> + rmap_item->root = root;
> rb_link_node(&rmap_item->node, parent, new);
> - rb_insert_color(&rmap_item->node, &root_unstable_tree);
> + rb_insert_color(&rmap_item->node, root);
>
> ksm_pages_unshared++;
> return NULL;
> @@ -1282,6 +1322,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> struct mm_slot *slot;
> struct vm_area_struct *vma;
> struct rmap_item *rmap_item;
> + int i;

I'd call it nid myself.

>
> if (list_empty(&ksm_mm_head.mm_list))
> return NULL;
> @@ -1300,7 +1341,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> */
> lru_add_drain_all();
>
> - root_unstable_tree = RB_ROOT;
> + for (i = 0; i < MAX_NUMNODES; i++)
> + root_unstable_tree[i] = RB_ROOT;
>
> spin_lock(&ksm_mmlist_lock);
> slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> @@ -1758,7 +1800,12 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
> stable_node = page_stable_node(newpage);
> if (stable_node) {
> VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> - stable_node->kpfn = page_to_pfn(newpage);
> +
> + if (ksm_merge_across_nodes ||
> + page_to_nid(oldpage) == page_to_nid(newpage))
> + stable_node->kpfn = page_to_pfn(newpage);
> + else
> + remove_node_from_stable_tree(stable_node);

Important point from Andrea discussed elsewhere.

> }
> }
> #endif /* CONFIG_MIGRATION */
> @@ -1768,15 +1815,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
> unsigned long end_pfn)
> {
> struct rb_node *node;
> + int i;

I'd call it nid myself.

>
> - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> - struct stable_node *stable_node;
> + for (i = 0; i < MAX_NUMNODES; i++)

It's irritating to have to do this outer nid loop, but I think you're
right, that the memory hotremove notification does not quite tell us
which node to look at. Or can we derive that from start_pfn? Would
it be safe to assume that end_pfn-1 must be in the same node?

> + for (node = rb_first(&root_stable_tree[i]); node;
> + node = rb_next(node)) {
> + struct stable_node *stable_node;
> +
> + stable_node = rb_entry(node, struct stable_node, node);
> + if (stable_node->kpfn >= start_pfn &&
> + stable_node->kpfn < end_pfn)
> + return stable_node;
> + }
>
> - stable_node = rb_entry(node, struct stable_node, node);
> - if (stable_node->kpfn >= start_pfn &&
> - stable_node->kpfn < end_pfn)
> - return stable_node;
> - }
> return NULL;
> }
>
> @@ -1926,6 +1977,47 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
> }
> KSM_ATTR(run);
>
> +#ifdef CONFIG_NUMA
> +static ssize_t merge_across_nodes_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%u\n", ksm_merge_across_nodes);
> +}
> +
> +static ssize_t merge_across_nodes_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int err;
> + unsigned long knob;
> +
> + err = kstrtoul(buf, 10, &knob);
> + if (err)
> + return err;
> + if (knob > 1)
> + return -EINVAL;
> +
> + mutex_lock(&ksm_thread_mutex);
> + if (ksm_run & KSM_RUN_MERGE) {
> + err = -EBUSY;
> + } else {
> + if (ksm_merge_across_nodes != knob) {
> + if (ksm_pages_shared > 0)
> + err = -EBUSY;
> + else
> + ksm_merge_across_nodes = knob;
> + }
> + }
> +
> + if (err)
> + count = err;
> + mutex_unlock(&ksm_thread_mutex);
> +
> + return count;
> +}
> +KSM_ATTR(merge_across_nodes);
> +#endif
> +
> static ssize_t pages_shared_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> @@ -1980,6 +2072,9 @@ static struct attribute *ksm_attrs[] = {
> &pages_unshared_attr.attr,
> &pages_volatile_attr.attr,
> &full_scans_attr.attr,
> +#ifdef CONFIG_NUMA
> + &merge_across_nodes_attr.attr,
> +#endif
> NULL,
> };
>
> --
> 1.7.11.4

Looks nice - thank you.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/