Re: [PATCH] mm, page_alloc: batch cma update on pcp buffer refill

From: Vlastimil Babka
Date: Tue Feb 21 2023 - 05:27:21 EST


On 2/17/23 13:05, Alexander Halbuer wrote:
> As proposed by Vlastimil Babka [1] this is an extension to the previous patch
> "mm: reduce lock contention of pcp buffer refill". This version also moves the
> is_migrate_cma() check from the critical section to the loop below.

Hi, thanks for trying it out!

> The optimization has several advantages:
> - Less time in critical section
> - Batching update of NR_FREE_CMA_PAGES into a single call to
> mod_zone_page_state()
> - Utilizing cache locality of the related struct page when doing the cma check
> is_migrate_cma() and the sanity check check_pcp_refill() in the same loop

The last point probably doesn't apply as we are reading/modifying page->lru
with __rmqueue()/list_add_tail() so that brings the struct page to cache
already.

>
> However, this change only affects performance with CONFIG_CMA=true which
> may not be the common case. Another possibly negative effect is that the
> actual update of NR_FREE_CMA_PAGES is delayed beyond the release of the
> zone lock resulting in a short time span of inaccuracy between the
> counter and the actual number of cma pages in the zone.

Hmm didn't realize that, that might be perhaps a problem.

> The tables below compare this patch with the initial one using a
> parallel allocation benchmark. The used kernel config is based on the default
> debian config but with CONFIG_INIT_ON_ALLOC_DEFAULT_ON=FALSE and
> CONFIG_CMA=TRUE. The benchmarks have been performed with the default sanity
> checks enabled as the patch "mm, page_alloc: reduce page alloc/free sanity
> checks" [2] was not enabled on my test branch.
> The given times are average allocation times. The improvement is not
> significant, but the general trend is visible.

Yeah there's some improvement, but if [2] is accepted, then keeping two
loops there just for the cma update (as there will be no more checking in
the second loop) will almost certainly stop being a win. And with the risk
of inaccuracy you pointed out, on top.

Incidentally, did you observe any improvements by [2] with your test,
especially as the batch freeing side also no longer does checking under zone
lock?

Thanks!

> Hopefully, without sanity checks and disabled cma, a compiler will be able
> to optimize away the second loop entirely.
>
> [1] https://lore.kernel.org/lkml/1d468148-936f-8816-eb71-1662f2d4945b@xxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20230216095131.17336-1-vbabka@xxxxxxx/
>
> Normal Pages
> +-------+---------+---------+---------+
> | Cores | Patch 1 | Patch 2 | Patch 2 |
> | | (ns) | (ns) | Diff |
> +-------+---------+---------+---------+
> | 1 | 132 | 129 | (-2.3%) |
> | 2 | 147 | 145 | (-1.4%) |
> | 3 | 148 | 147 | (-0.7%) |
> | 4 | 175 | 173 | (-1.1%) |
> | 6 | 263 | 255 | (-3.0%) |
> | 8 | 347 | 337 | (-2.9%) |
> | 10 | 432 | 421 | (-2.5%) |
> | 12 | 516 | 505 | (-2.1%) |
> | 14 | 604 | 590 | (-2.3%) |
> | 16 | 695 | 680 | (-2.2%) |
> | 20 | 869 | 844 | (-2.9%) |
> | 24 | 1043 | 1015 | (-2.7%) |
> +-------+---------+---------+---------+
>
> Huge Pages
> +-------+---------+---------+---------+
> | Cores | Patch 1 | Patch 2 | Patch 2 |
> | | (ns) | (ns) | Diff |
> +-------+---------+---------+---------+
> | 1 | 3177 | 3133 | (-1.4%) |
> | 2 | 3486 | 3471 | (-0.4%) |
> | 3 | 3644 | 3634 | (-0.3%) |
> | 4 | 3669 | 3643 | (-0.7%) |
> | 6 | 3587 | 3578 | (-0.3%) |
> | 8 | 3635 | 3621 | (-0.4%) |
> | 10 | 4015 | 3960 | (-1.4%) |
> | 12 | 4681 | 4510 | (-3.7%) |
> | 14 | 5398 | 5180 | (-4.0%) |
> | 16 | 6239 | 5891 | (-5.6%) |
> | 20 | 7864 | 7435 | (-5.5%) |
> | 24 | 9011 | 8971 | (-0.4%) |
> +-------+---------+---------+---------+
>
> Reported-by: Vlastimil Babka <vbabka@xxxxxxx>
> Signed-off-by: Alexander Halbuer <halbuer@xxxxxxxxxxxxxxxxxxx>
> ---
> mm/page_alloc.c | 48 ++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 36 insertions(+), 12 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0745aedebb37..f82a59eeb4fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3119,17 +3119,17 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> {
> unsigned long flags;
> int i, allocated = 0;
> + int cma_pages = 0;
> + struct list_head *prev_tail = list->prev;
> + struct page *pos, *n;
>
> spin_lock_irqsave(&zone->lock, flags);
> for (i = 0; i < count; ++i) {
> - struct page *page = __rmqueue(zone, order, migratetype,
> - alloc_flags);
> + struct page *page =
> + __rmqueue(zone, order, migratetype, alloc_flags);
> if (unlikely(page == NULL))
> break;
>
> - if (unlikely(check_pcp_refill(page, order)))
> - continue;
> -
> /*
> * Split buddy pages returned by expand() are received here in
> * physical page order. The page is added to the tail of
> @@ -3141,20 +3141,44 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> * pages are ordered properly.
> */
> list_add_tail(&page->pcp_list, list);
> - allocated++;
> - if (is_migrate_cma(get_pcppage_migratetype(page)))
> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> - -(1 << order));
> }
>
> /*
> - * i pages were removed from the buddy list even if some leak due
> - * to check_pcp_refill failing so adjust NR_FREE_PAGES based
> - * on i. Do not confuse with 'allocated' which is the number of
> + * i pages were removed from the buddy list so adjust NR_FREE_PAGES
> + * based on i. Do not confuse with 'allocated' which is the number of
> * pages added to the pcp list.
> */
> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> +
> spin_unlock_irqrestore(&zone->lock, flags);
> +
> + /*
> + * Pages are appended to the pcp list without checking to reduce the
> + * time holding the zone lock. Checking the appended pages happens right
> + * after the critical section while still holding the pcp lock.
> + */
> + pos = list_first_entry(prev_tail, struct page, pcp_list);
> + list_for_each_entry_safe_from(pos, n, list, pcp_list) {
> + /*
> + * Count number of cma pages to batch update of
> + * NR_FREE_CMA_PAGES into a single function call.
> + */
> + if (is_migrate_cma(get_pcppage_migratetype(pos)))
> + cma_pages++;
> +
> + if (unlikely(check_pcp_refill(pos, order))) {
> + list_del(&pos->pcp_list);
> + continue;
> + }
> +
> + allocated++;
> + }
> +
> + if (cma_pages > 0) {
> + mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> + -(cma_pages << order));
> + }
> +
> return allocated;
> }
>