Re: [PATCH v3] KSM: numa awareness sysfs knob

From: Andrew Morton
Date: Fri Sep 14 2012 - 18:02:43 EST

Next message: Dhaval Giani: "Re: [RFC] cgroup TODOs"
Previous message: Mark Langsdorf: "Re: linux-next: Tree for Sept 14 (sata_highhbank)"
In reply to: Petr Holasek: "[PATCH v3] KSM: numa awareness sysfs knob"
Next in thread: Petr Holasek: "Re: [PATCH v3] KSM: numa awareness sysfs knob"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 14 Sep 2012 23:22:47 +0200
Petr Holasek <pholasek@xxxxxxxxxx> wrote:

> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_nodes

I wonder if merge_across_nodes would be a better name.

> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
>
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher scalability.

Well... what is the use case for merge_nodes=0? IOW, why shouldn't we
make this change non-optional and avoid the sysfs knob?

> Every numa node has its own stable & unstable trees because
> of faster searching and inserting. Changing of merge_nodes
> value is possible only when there are not any ksm shared pages in system.
>
> This patch also adds share_all sysfs knob which can be used for adding
> all anon vmas of all processes in system to ksmd scan queue. Knob can be
> triggered only when run knob is set to zero.

I really don't understand this share_all thing. From reading the code,
it is a once-off self-resetting trigger thing. Why? How is one to use
this? What's the benefit? What's the effect?

> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
>
> http://pholasek.fedorapeople.org/alloc_pg.c
>
> Population standard deviations of access times in percentage of average
> were following:
>
> merge_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes 1.7%
>
> merge_nodes=0
> 2 nodes 1%
> 4 nodes 0.32%
> 8 nodes 0.018%
>
>
> ...
>
> @@ -462,7 +473,13 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
> cond_resched();
> }
>
> - rb_erase(&stable_node->node, &root_stable_tree);
> + if (ksm_merge_nodes)
> + nid = 0;
> + else
> + nid = pfn_to_nid(stable_node->kpfn);

This sequence happens three times - it might be a little tidier to
capture the above into a separate helper function. Or not bother ;)
One benefit of the standalone function is that it provides a nice site
for a comment

> + rb_erase(&stable_node->node,
> + &root_stable_tree[nid]);
> free_stable_node(stable_node);
> }
>
>
> ...
>
> +static int ksmd_should_run(void)
> +{
> + return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
> +}
> +
> +static int ksmd_should_madvise(void)
> +{
> + return ksm_share_all;
> +}
> +
> +static int ksm_madvise_all(void)
> +{
> + struct task_struct *p;
> + int err;
> +
> + for_each_process(p) {

what, what. We can't just go waltzing across the task list without
taking any locks. Needs rcu_read_lock(), methinks.

Also... I've forgotten how threads/processes are arranged. Will this
walk across all the threads in the system? If so, that would be
terribly inefficient walking 1000 task structs which share a single mm,
manipulating that mm 1000 times. It might be better to walk the mm's
instead - see mm_struct.mmlist.

> + read_lock(&tasklist_lock);
> +
> + if (!p->mm)
> + goto out;
> +
> + down_write(&p->mm->mmap_sem);

whoa, you can't do down_write() inside read_lock().

Please, immediately put down your mail client, read
Documentation/SubmitChecklist section 12 and go make the appropriate
changes to your kernel .config.

> + err = ksm_madvise_mm(p->mm);
> + up_write(&p->mm->mmap_sem);
> +out:
> + read_unlock(&tasklist_lock);
> + if (err)
> + break;
> + cond_resched();
> + }
> + return err;
> +}
> +
> +/**
> + * ksm_do_scan - the ksm scanner main worker function.
> + * @scan_npages - number of pages we want to scan before we return.
> + */
> +static void ksm_do_scan(unsigned int scan_npages)
> +{
> + struct rmap_item *rmap_item;
> + struct page *uninitialized_var(page);

gcc is silly. I think that got fixed in more recent versions.

> + while (scan_npages-- && likely(!freezing(current))) {
> + cond_resched();
> + rmap_item = scan_get_next_rmap_item(&page);
> + if (!rmap_item)
> + return;
> + if (!PageKsm(page) || !in_stable_tree(rmap_item))
> + cmp_and_merge_page(page, rmap_item);
> + put_page(page);
> + }
> +}
> +
> +static int ksm_scan_thread(void *nothing)
> +{
> + set_freezable();
> + set_user_nice(current, 5);

The reason for the set_user_nice() is a total mystery to this and any
other reader. Hence it needs a comment.

> + while (!kthread_should_stop()) {
> + mutex_lock(&ksm_thread_mutex);
> + if (ksmd_should_madvise()) {
> + ksm_madvise_all();
> + ksm_share_all = 0;
> + }
> + if (ksmd_should_run())
> + ksm_do_scan(ksm_thread_pages_to_scan);
> + mutex_unlock(&ksm_thread_mutex);
> +
> + try_to_freeze();
> +
> + if (ksmd_should_run()) {
> + schedule_timeout_interruptible(
> + msecs_to_jiffies(ksm_thread_sleep_millisecs));
> + } else {
> + wait_event_freezable(ksm_thread_wait,
> + ksmd_should_run() ||
> + ksmd_should_madvise() ||
> + kthread_should_stop());
> + }
> + }
> + return 0;
> +}
> +
> struct page *ksm_does_need_to_copy(struct page *page,
> struct vm_area_struct *vma, unsigned long address)
> {
>
> ...
>
> +static ssize_t merge_nodes_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int err;
> + unsigned long knob;
> +
> + err = kstrtoul(buf, 10, &knob);
> + if (err)
> + return err;
> + if (knob > 1)
> + return -EINVAL;
> +
> + mutex_lock(&ksm_thread_mutex);
> + if (ksm_run & KSM_RUN_MERGE) {
> + err = -EBUSY;
> + } else {
> + if (ksm_merge_nodes != knob) {
> + if (ksm_pages_shared > 0)
> + err = -EBUSY;

What's happening here? The attempt to set merge_nodes can randomly
fail due to internal transient state within ksm? That sounds rather
user-hostile.

What did the user do wrong and how should he correct the situation?

What documentation should he have read to avoid this mistake?

> + else
> + ksm_merge_nodes = knob;
> + }
> + }
> +
> + if (err)
> + count = err;
> + mutex_unlock(&ksm_thread_mutex);
> +
> + return count;
> +}
> +KSM_ATTR(merge_nodes);
> +#endif
> +
> +static ssize_t share_all_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%u\n", ksm_share_all);
> +}
> +
> +static ssize_t share_all_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int err;
> + unsigned long knob;
> +
> + err = kstrtoul(buf, 10, &knob);
> + if (err)
> + return err;
> + if (knob > 1)
> + return -EINVAL;
> +
> + mutex_lock(&ksm_thread_mutex);
> + if (ksm_run & KSM_RUN_MERGE) {
> + err = -EBUSY;

OK, this one makes more sense: the user most stop KSM before altering
share_all. Document this?

> + } else {
> + if (ksm_share_all != knob)
> + ksm_share_all = knob;
> + }
> + if (err)
> + count = err;
> + mutex_unlock(&ksm_thread_mutex);
> +
> + return count;
> +}
> +KSM_ATTR(share_all);
>
> ...
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dhaval Giani: "Re: [RFC] cgroup TODOs"
Previous message: Mark Langsdorf: "Re: linux-next: Tree for Sept 14 (sata_highhbank)"
In reply to: Petr Holasek: "[PATCH v3] KSM: numa awareness sysfs knob"
Next in thread: Petr Holasek: "Re: [PATCH v3] KSM: numa awareness sysfs knob"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]