Re: [PATCH] Fix fake numa on ppc

From: David Rientjes
Date: Wed Sep 02 2009 - 01:58:56 EST

On Wed, 2 Sep 2009, Ankita Garg wrote:

> > > With the patch,
> > >
> > > # cat /proc/cmdline
> > > root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > > # cat /sys/devices/system/node/node0/cpulist
> > > 0-3
> > > # cat /sys/devices/system/node/node1/cpulist
> > >
> >
> > Oh! interesting.. cpuless nodes :) I think we need to fix this in the
> > longer run and distribute cpus between fake numa nodes of a real node
> > using some acceptable heuristic.
> >
> True. Presently this is broken on both x86 and ppc systems. It would be
> interesting to find a way to map, for example, 4 cpus to >4 number of
> fake nodes created from a single real numa node!

We've done it for years on x86_64. It's quite trivial to map all fake
nodes within a physical node to the cpus to which they have affinity both
via node_to_cpumask_map() and cpu_to_node_map(). There should be no
kernel space dependencies on a cpu appearing in only a single node's
cpumask and if you map each fake node to its physical node's pxm, you can
index into the slit and generate local NUMA distances amongst fake nodes.

So if you map the apicids and pxms appropriately depending on the
physical topology of the machine, that is the only emulation necessary on
x86_64 for the page allocator zonelist ordering, task migration, etc. (If
you use CONFIG_SLAB, you'll need to avoid the exponential growth of alien
caches, but that's an implementation detail and isn't really within the
scope of numa=fake's purpose to modify.)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at