RE: Memory policy question for NUMA arch....

From: Lee Schermerhorn
Date: Mon Apr 19 2010 - 11:16:22 EST


On Fri, 2010-04-16 at 16:17 -0700, Chetan Loke wrote:
> Hello,
>
> PS - Please 'CC' me on the emails.I have not subscribed to the list.
>
> > Hi Andy,
> >
> > --- On Wed, 4/7/10, Andi Kleen <andi@xxxxxxxxxxxxxx>
> > wrote:
> > > On Tue, Apr 06, 2010 at 01:46:44PM -0700, Rick Sherm
> > wrote:
> > > > On a NUMA host, if a driver calls
> > __get_free_pages()
> > > then
> > > > it will eventually invoke
> > > ->alloc_pages_current(..). The comment
> > > > above/within alloc_pages_current() says
> > > 'current->mempolicy' will be
> > > > used.So what memory policy will kick-in if the
> > driver
> > > is trying to
> > > > allocate some memory blocks during driver load
> > > time(say from probe_one)? System-wide default
> > > policy,correct?
> > >
> > > Actually the policy of the modprobe or the kernel boot
> > up
> > > if built in
> > > (which is interleaving)
> > >
>
> I may be wrong but I think there's a difference. system-wide run-time default policy is M_PREFERRED | M_LOCAL and not Interleaving.
>
> So, if current->mempolicy is set then default_policy will not be used.
> And now if you don't want the default_policy mode then what?
> I'm stuck in this confused state too. So we have two cases to take care off -
>
> Case1) current->mempolicy is initialized and so we can just set it to
> whatever we like and then reset it once we are done with
> __get_free_pages(..) etc.

Yes, as Andi mentioned. Also, see my response to Rick at:

http://marc.info/?l=linux-kernel&m=127066130315241&w=4


>
> Case2) current->mempolicy is not initialized. Then default_policy is
> used. Now if we have to muck with the default_policy then we will need
> to lock it down. Otherwise some other consumer will get affected by
> it.

If current->mempolicy is not initialized, you can create a new one and
set it temporarily. You could probably call do_set_mempolicy() directly
the way numa_policy_init() does and then call numa_default_policy() to
restore it to default.

You should never change the system default once the system is up and
running.

>
> But both the above solutions are twisted.Why not just create a
> different wrapper? This way we can leave both current & default_policy
> alone.
>
> #ifdef CONFIG_NUMA
> __get_free_policy_pages(policy,mask,order)??
> endif

As Andi mentioned in his response, you could certainly do this as long
as it doesn't impact the normal allocation path.
>
> For now I may end up hacking my kernel and implementing the above
> mentioned quick and dirty solution. But if there's a cleaner approach
> then please let me know.
>
> PS - We should create some wrapper's that will automatically figure
> out the MSIX-affinity(if present/set) and then default the allocation
> to that node?

Still not clear on what your requirements are but, if existing
interfaces don't suffice, such a wrapper might make sense.
__get_free_pages() is simply a wrapper around alloc_pages() that then
returns page_address() of the resulting page. So, something like
'get_free_pages_node()'--which should probably live in
mm/page_alloc.c--would just be a wrapper around alloc_pages_node() that
then returns the page_address() of the page.

A device-centric interface--e.g., 'get_free_pages_dev()'--could get the
device/bus node affinity via dev_to_node() and then do the
allocation/conversion. I think this is close to what you're suggesting
above. See dma_generic_alloc_coherent() [in arch/x86/kernel/pci-dma.c]
for an example of a wrapper that does the device affinity lookup and
allocation in one function.

Of course, you could just do this in your driver, as well.

> Also, is there a way to configure irqbalance and ask it to leave these
> guys alone? Like a config file that says - leave these
> irqs/pci-devices alone.For now I've shut down irqbalance.

You can set the environment variable IRQBALANCE_BANNED_INTERRUPTS--when
starting irqbalance--to list of interrupts that irqbalance should ignore
if you're using a version that supports that. Check the init script
that starts irqbalance on your distro of choice.

Regards,
Lee

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/