Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

From: Chuck Lever III
Date: Sun Mar 14 2021 - 11:23:43 EST




> On Mar 14, 2021, at 8:52 AM, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Sat, Mar 13, 2021 at 07:33:43PM +0000, Matthew Wilcox wrote:
>> On Sat, Mar 13, 2021 at 04:56:31PM +0000, Chuck Lever III wrote:
>>> IME lists are indeed less CPU-efficient, but I wonder if that
>>> expense is insignificant compared to serialization primitives like
>>> disabling and re-enabling IRQs, which we are avoiding by using
>>> bulk page allocation.
>>
>> Cache misses are a worse problem than serialisation. Paul McKenney had
>> a neat demonstration where he took a sheet of toilet paper to represent
>> an instruction, and then unrolled two rolls of toilet paper around the
>> lecture theatre to represent an L3 cache miss. Obviously a serialising
>> instruction is worse than an add instruction, but i'm thinking maybe
>> 50-100 sheets of paper, not an entire roll?
>>
>
> I'm well array of the advantages of arrays over lists. The reality is that
> the penalty is incurred unconditionally as the pages have to be removed
> from the per-cpu or buddy lists and the cache footprint of the allocator
> and the data copies are already large. It's also the case that bulk free
> interfaces already exist that operate on lists (free_unref_page_list)
> so there is existing precedent. The bulk free API in this series was not
> used by the callers so I've deleted it.
>
> Obviously the callers would need to be adjusted to use the array
> interface. The sunrpc user has an array but it is coded in a way that
> expects the array could be partially populated or has holes so the API has
> to skip populated elements. The caller is responsible for making sure that
> there are enough NULL elements available to store nr_pages or the buffer
> overruns. nr_elements could be passed in to avoid the buffer overrun but
> then further logic is needed to distinguish between a failed allocation
> and a failure to have enough space in the array to store the pointer.
> It also means that prep_new_page() should not be deferred outside of
> the IRQ disabled section as it does not have the storage to track which
> pages were freshly allocated and which ones were already on the array. It
> could be tracked using the lower bit of the pointer but that is not free
> either. Ideally the callers simply would ensure the array does not have
> valid struct page pointers in it already so prepping the new page could
> always be deferred. Obviously the callers are also responsible for
> ensuring protecting the array from parallel access if necessary while
> calling into the allocator.
>
>> Anyway, I'm not arguing against a bulk allocator, nor even saying this
>> is a bad interface. It just maybe could be better.
>>
>
> I think it puts more responsibility on the caller to use the API correctly
> but I also see no value in arguing about it further because there is no
> supporting data either way (I don't have routine access to a sufficiently
> fast network to generate the data). I can add the following patch and let
> callers figure out which interface is preferred. If one of the interfaces
> is dead in a year, it can be removed.
>
> As there are a couple of ways the arrays could be used, I'm leaving it
> up to Jesper and Chuck which interface they want to use. In particular,
> it would be preferred if the array has no valid struct pages in it but
> it's up to them to judge how practical that is.

I'm interested to hear from Jesper.

My two cents (US):

If svc_alloc_arg() is the /only/ consumer that wants to fill
a partially populated array of page pointers, then there's no
code-duplication benefit to changing the synopsis of
alloc_pages_bulk() at this point.

Also, if the consumers still have to pass in the number of
pages the array needs, rather than having the bulk allocator
figure it out, then there's not much additional benefit, IMO.

Ideally (for SUNRPC) alloc_pages_bulk() would take a pointer
to a sparsely-populated array and the total number of elements
in that array, and fill in the NULL elements. The return value
would be "success -- all elements are populated" or "failure --
some elements remain NULL".

But again, if no other consumer finds that useful, or that API
design obscures the performance benefits of the bulk allocator,
I can be very happy with the list-centric API. My interest in
this part of the exercise is simply to reduce the overall amount
of new complexity across mm/ and consumers of the bulk allocator.


> Patch is only lightly tested with a poor conversion of the sunrpc code
> to use the array interface.
>
> ---8<---
> mm/page_alloc: Add an array-based interface to the bulk page allocator
>
> The existing callers for the bulk allocator are storing the pages in
> arrays. This patch adds an array-based interface to the API to avoid
> multiple list iterations. The page list interface is preserved to
> avoid requiring all users of the bulk API to allocate and manage
> enough storage to store the pages.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4a304fd39916..fb6234e1fe59 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -520,13 +520,20 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
>
> int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> nodemask_t *nodemask, int nr_pages,
> - struct list_head *list);
> + struct list_head *page_list,
> + struct page **page_array);
>
> /* Bulk allocate order-0 pages */
> static inline unsigned long
> -alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> +alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> {
> - return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
> + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
> +}
> +
> +static inline unsigned long
> +alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
> +{
> + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
> }
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e0c87c588d3..96590f0726c7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4965,13 +4965,20 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
>
> /*
> * This is a batched version of the page allocator that attempts to
> - * allocate nr_pages quickly from the preferred zone and add them to list.
> + * allocate nr_pages quickly from the preferred zone. Pages are added
> + * to page_list if page_list is not NULL, otherwise it is assumed
> + * that the page_array is valid.
> + *
> + * If using page_array, only NULL elements are populated with pages.
> + * The caller must ensure that the array has enough NULL elements
> + * to store nr_pages or the buffer overruns.
> *
> * Returns the number of pages allocated.
> */
> int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> nodemask_t *nodemask, int nr_pages,
> - struct list_head *alloc_list)
> + struct list_head *page_list,
> + struct page **page_array)
> {
> struct page *page;
> unsigned long flags;
> @@ -4987,6 +4994,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> if (WARN_ON_ONCE(nr_pages <= 0))
> return 0;
>
> + if (WARN_ON_ONCE(!page_list && !page_array))
> + return 0;
> +
> if (nr_pages == 1)
> goto failed;
>
> @@ -5035,7 +5045,24 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> break;
> }
>
> - list_add(&page->lru, alloc_list);
> + if (page_list) {
> + /* New page prep is deferred */
> + list_add(&page->lru, page_list);
> + } else {
> + /* Skip populated elements */
> + while (*page_array)
> + page_array++;
> +
> + /*
> + * Array pages must be prepped immediately to
> + * avoid tracking which pages are new and
> + * which ones were already on the array.
> + */
> + prep_new_page(page, 0, gfp, 0);
> + *page_array = page;
> + page_array++;
> + }
> +
> allocated++;
> }
>
> @@ -5044,9 +5071,12 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>
> local_irq_restore(flags);
>
> - /* Prep page with IRQs enabled to reduce disabled times */
> - list_for_each_entry(page, alloc_list, lru)
> - prep_new_page(page, 0, gfp, 0);
> + /* Prep pages with IRQs enabled if using a list */
> + if (page_list) {
> + list_for_each_entry(page, page_list, lru) {
> + prep_new_page(page, 0, gfp, 0);
> + }
> + }
>
> return allocated;
>
> @@ -5056,7 +5086,10 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> failed:
> page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
> if (page) {
> - list_add(&page->lru, alloc_list);
> + if (page_list)
> + list_add(&page->lru, page_list);
> + else
> + *page_array = page;
> allocated = 1;
> }
>

--
Chuck Lever