Re: [PATCH 2/2] mm: eliminate node 0 bias in MPOL_INTERLEAVE

From: Ray Bryant
Date: Fri Sep 24 2004 - 01:50:06 EST


Andi Kleen wrote:
On Wed, Sep 22, 2004 at 11:32:45PM -0500, Ray Bryant wrote:

Each of these cases potentially breaks the (assumed) invariant of


I would prefer to keep the invariant.


I understand, but read on...


+++ linux-2.6.9-rc2-mm1/mm/mempolicy.c 2004-09-21 17:44:58.000000000 -0700
@@ -435,7 +435,7 @@ asmlinkage long sys_set_mempolicy(int re
default_policy[policy] = new;
}
if (new && new->policy == MPOL_INTERLEAVE)
- current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
+ current->il_next = current->pid % MAX_NUMNODES;


Please do the find_next/find_first bit here in the slow path.

Another useful change may be to check if il_next points to a node
that is in the current interleaving mask. If yes don't change it.
This way skew when interleaving policy is set often could be avoided.


return 0;
}

@@ -714,6 +714,11 @@ static unsigned interleave_nodes(struct

nid = me->il_next;
BUG_ON(nid >= MAX_NUMNODES);
+ if (!test_bit(nid, policy->v.nodes)) {
+ nid = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
+ if (nid >= MAX_NUMNODES)
+ nid = find_first_bit(policy->v.nodes, MAX_NUMNODES);
+ }


And remove it here.


Regardless of whether we remove this or not, then we have a potential problem, I think. The reason is that there is a single il_next for all policies. So we get into trouble if the current process's page allocation policy and
its page cache allocation policy are MPOL_INTERLEAVE, but the node masks for
the two policies are significantly different. Just to be specific, suppose there are 64 nodes, and the page allocation policy selects nodes 0-53 and
the page cache allocation policy chooses nodes 54-63. Further suppose that
allocation requests are page, page cache, page, page cache, etc....

Then if il_next starts out at zero, here are the nodes that will be selected:
(I'm assuming here that the code I inserted above is not present.)

request a page, get 0 and using the page allocation mask, next is set to 1

request page cache, get 1 and using the page cache allocation mask, next is set to 54

request a page, get 54 and using the page allocation mask, next is set to 0

request page cache, get 0 and using the page cache allocation mask, next is set to 54

request a page, get 54 and using the page allocation mask, next is set to 0
etc...

This is not good. Generally speaking, all of the pages are allocated from the 1st page cache node and all of the page cache pages are allocated from the 1st page allocation node.

I guess I am back to passing an offset etc in via page cache alloc. Or we
have to have a second il_next for the page cache policy, and that is more
cruft than we are willing to live with, I expect.

I'll look at Steve's patch and see how he handles this.


next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
if (next >= MAX_NUMNODES)
next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
Index: linux-2.6.9-rc2-mm1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc2-mm1.orig/kernel/fork.c 2004-09-21 16:24:49.000000000 -0700
+++ linux-2.6.9-rc2-mm1/kernel/fork.c 2004-09-21 17:41:12.000000000 -0700
@@ -873,6 +873,8 @@ static task_t *copy_process(unsigned lon
goto bad_fork_cleanup;
}
}
+ /* randomize placement of first page across nodes */
+ p->il_next = p->pid % MAX_NUMNODES;


Same here.

-Andi



--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@xxxxxxx raybry@xxxxxxxxxxxxx
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/