Re: [PATCH] page_alloc: skip cpuset enforcement for lower zone allocations

From: Marcelo Tosatti
Date: Tue May 27 2014 - 10:54:46 EST


On Tue, May 27, 2014 at 09:21:32AM -0500, Christoph Lameter wrote:
> On Fri, 23 May 2014, Marcelo Tosatti wrote:
>
> > Zone specific allocations, such as GFP_DMA32, should not be restricted
> > to cpusets allowed node list: the zones which such allocations demand
> > might be contained in particular nodes outside the cpuset node list.
> >
> > The alternative would be to not perform such allocations from
> > applications which are cpuset restricted, which is unrealistic.
> >
> > Fixes KVM's alloc_page(gfp_mask=GFP_DMA32) with cpuset as explained.
>
> Memory policies are only applied to a specific zone so this is not
> unprecedented. However, if a user wants to limit allocation to a specific
> node and there is no DMA memory there then may be that is a operator
> error? After all the application will be using memory from a node that the
> operator explicitly wanted not to be used.

Ok here is the use-case:

- machine contains driver which requires zone specific memory (such as
KVM, which requires root pagetable at paddr < 4GB).

- user wants to limit allocation of application to nodeX, and nodeX has
no memory < 4GB.

How would you solve that? Options:

1) force admin to allow allocation from node(s) which contain 0-4GB
range, which unfortunately would allow every allocation, including
ones which are not restricted to particular nodes, to be performed
there.

or

2) allow zone specific allocations to bypass memory policies.

It seems 2) is the best option (and there is precedent for it).

> There is also the hardwall flag. I think its ok to allocate outside of the
> cpuset if that flag is not set. However, if it is set then any attempt to
> alloc outside of the cpuset should fail.

GFP_ATOMIC bypasses hardwall:

* The second pass through get_page_from_freelist() doesn't even call
* here for GFP_ATOMIC calls. For those calls, the __alloc_pages()
* variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
* in alloc_flags. That logic and the checks below have the combined
* affect that:
* in_interrupt - any node ok (current task context irrelevant)
* GFP_ATOMIC - any node ok
* TIF_MEMDIE - any node ok
* GFP_KERNEL - any node in enclosing hardwalled cpuset ok
* GFP_USER - only nodes in current tasks mems allowed ok.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/