Re: [Lse-tech] Re: [PATCH] subset zonelists and big numa friendlymempolicy MPOL_MBIND

From: Paul Jackson
Date: Tue Aug 03 2004 - 03:04:25 EST


Earlier, I (pj) wrote:
> It has poor cache performance on big iron. For a modest job on a big
> system, the allocator has to walk down an average of 128 out of 256 zone
> pointers in the list, derefencing each one into the zone struct, then
> into the struct pglist_data, before it finds one that matches an allowed
> node id. That's a nasty memory footprint for a hot code path.

This paragraph is B.S. Most tasks are running on CPUs that are on nodes
whose memory they are allowed to use. That node is at the front of the
local zonelist, and they get their memory on the first node they look.

Damn ... hate it when that happens ;).

Still, either MPOL_BIND needs a more numa friendly set of zonelists
having a differently sorted list for each node in the set, or it's
usefulness for binding to more than one or a few very close nodes, if
you care about memory performance, falls off quickly as the number of
nodes increases. As you well know, any such numa-friendly set of sorted
zonelists will require space on the Order of N**2, for N the node count,
given the NULL-terminated linear list form in which they must be handed
to __alloc_pages.

I suspect that the English phrase you are searching for now to tell me
is "if it hurts, don't use it ;)." That is, you are clearly advising me
not to use MPOL_BIND if I need a fancy zonelist sort.

The place I ran into the most complexity doing this in the 2.4 kernel
was in the per-memory region binding. You're dealing with this in the
2.6 kernels, and when you get to stuff like shared memory and huge
pages, it's not easy. At least the vma splitting code is better in
2.6 than it was in 2.4. Whatever I do for cpusets must _not_ duplicate
your virtual address range specific work (mbind). Too much detail to be
done twice.

Andi wrote:
> My first reaction that if you really want to do that, just pass
> the policy node bitmap to alloc_pages and try_to_free_pages
> and use the normal per node zone list with the bitmap as filter.

Pass in, or add to task_struct? I can imagine adding a:

nodemask_t mems_allowed;

to task_struct, and ending up with a CONFIG_CPUSET enabled macro called
in a few places in __alloc_pages() and try_to_free_pages() that amounts
to:

if (!in_interrupt())
if (!node_isset(z->zone_pgdat->node_id, current->mems_allowed))
continue;

In any event, cpusets provides the larger "container" on bigger numa
systems, and mbind/mempolicy provides the more detailed, and vma
specific, placement within the container (or within the entire system
if cpusets not configured).

I'll try coding this up and see how it looks.

I welcome your further comments.

Thank-you.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.650.933.1373
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/