Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order

From: Zi Yan
Date: Tue Jun 03 2025 - 11:14:55 EST


On 3 Jun 2025, at 10:55, Zi Yan wrote:

> On 3 Jun 2025, at 9:03, David Hildenbrand wrote:
>
>> On 21.05.25 23:57, Juan Yescas wrote:
>>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>>> and this causes the CMA reservations to be larger than necessary.
>>> This means that system will have less available MIGRATE_UNMOVABLE and
>>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>>
>>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>>
>>> For example, in ARM, the CMA alignment requirement when:
>>>
>>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>>
>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>> -----------------------------------------------------------------------
>>> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
>>> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>>
>>> There are some extreme cases for the CMA alignment requirement when:
>>>
>>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>>> - CONFIG_HUGETLB_PAGE is NOT set
>>>
>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>> ------------------------------------------------------------------------
>>> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
>>> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>>
>>> This affects the CMA reservations for the drivers. If a driver in a
>>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>>> reservation has to be 32MiB due to the alignment requirements:
>>>
>>> reserved-memory {
>>> ...
>>> cma_test_reserve: cma_test_reserve {
>>> compatible = "shared-dma-pool";
>>> size = <0x0 0x400000>; /* 4 MiB */
>>> ...
>>> };
>>> };
>>>
>>> reserved-memory {
>>> ...
>>> cma_test_reserve: cma_test_reserve {
>>> compatible = "shared-dma-pool";
>>> size = <0x0 0x2000000>; /* 32 MiB */
>>> ...
>>> };
>>> };
>>>
>>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>>> allows to set the page block order in all the architectures.
>>> The maximum page block order will be given by
>>> ARCH_FORCE_MAX_ORDER.
>>>
>>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>>> current kernel configurations won't be affected by this
>>> change. It is a opt-in change.
>>>
>>> This patch will allow to have the same CMA alignment
>>> requirements for large page sizes (16KiB, 64KiB) as that
>>> in 4kb kernels by setting a lower pageblock_order.
>>>
>>> Tests:
>>>
>>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>>> on 4k and 16k kernels.
>>>
>>> - Verified that Transparent Huge Pages work when pageblock_order
>>> is 1, 7, 10 on 4k and 16k kernels.
>>>
>>> - Verified that dma-buf heaps allocations work when pageblock_order
>>> is 1, 7, 10 on 4k and 16k kernels.
>>>
>>> Benchmarks:
>>>
>>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>>> reason for the pageblock_order 7 is because this value makes the min
>>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>>
>>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>>> the # of instructions and page-faults on 16k kernels.
>>> The benchmark was executed 10 times. The averages are below:
>>>
>>> # instructions | #page-faults
>>> order 10 | order 7 | order 10 | order 7
>>> --------------------------------------------------------
>>> 13,891,765,770 | 11,425,777,314 | 220 | 217
>>> 14,456,293,487 | 12,660,819,302 | 224 | 219
>>> 13,924,261,018 | 13,243,970,736 | 217 | 221
>>> 13,910,886,504 | 13,845,519,630 | 217 | 221
>>> 14,388,071,190 | 13,498,583,098 | 223 | 224
>>> 13,656,442,167 | 12,915,831,681 | 216 | 218
>>> 13,300,268,343 | 12,930,484,776 | 222 | 218
>>> 13,625,470,223 | 14,234,092,777 | 219 | 218
>>> 13,508,964,965 | 13,432,689,094 | 225 | 219
>>> 13,368,950,667 | 13,683,587,37 | 219 | 225
>>> -------------------------------------------------------------------
>>> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
>>>
>>> There were 4.85% #instructions when order was 7, in comparison
>>> with order 10.
>>>
>>> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>>
>>> The number of page faults in order 7 and 10 were the same.
>>>
>>> These results didn't show any significant regression when the
>>> pageblock_order is set to 7 on 16kb kernels.
>>>
>>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>>> on the 16k kernels with pageblock_order 7 and 10.
>>>
>>> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
>>> -------------------------------------------------------------------
>>> 15.8 | 16.4 | 0.6 | 3.80%
>>> 16.4 | 16.2 | -0.2 | -1.22%
>>> 16.6 | 16.3 | -0.3 | -1.81%
>>> 16.8 | 16.3 | -0.5 | -2.98%
>>> 16.6 | 16.8 | 0.2 | 1.20%
>>> -------------------------------------------------------------------
>>> 16.44 16.4 -0.04 -0.24% Averages
>>>
>>> The results didn't show any significant regression when the
>>> pageblock_order is set to 7 on 16kb kernels.
>>>
>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>>> Cc: Vlastimil Babka <vbabka@xxxxxxx>
>>> Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
>>> Cc: David Hildenbrand <david@xxxxxxxxxx>
>>> CC: Mike Rapoport <rppt@xxxxxxxxxx>
>>> Cc: Zi Yan <ziy@xxxxxxxxxx>
>>> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
>>> Cc: Minchan Kim <minchan@xxxxxxxxxx>
>>> Signed-off-by: Juan Yescas <jyescas@xxxxxxxxxx>
>>> Acked-by: Zi Yan <ziy@xxxxxxxxxx>
>>> ---
>>> Changes in v7:
>>> - Update alignment calculation to 2MiB as per David's
>>> observation.
>>> - Update page block order calculation in mm/mm_init.c for
>>> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>>
>>> Changes in v6:
>>> - Applied the change provided by Zi Yan to fix
>>> the Kconfig. The change consists in evaluating
>>> to true or false in the if expression for range:
>>> range 1 <symbol> if <expression to eval true/false>.
>>>
>>> Changes in v5:
>>> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>>> ranges with config definitions don't work in Kconfig,
>>> for example (range 1 MY_CONFIG).
>>> - Add PAGE_BLOCK_ORDER_MANUAL config for the
>>> page block order number. The default value was not
>>> defined.
>>> - Fix typos reported by Andrew.
>>> - Test default configs in powerpc.
>>>
>>> Changes in v4:
>>> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>>> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>>> compile time.
>>> - This change fixes the warning in:
>>> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@xxxxxxxxx/
>>>
>>> Changes in v3:
>>> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>>> as per Matthew's suggestion.
>>> - Update comments in pageblock-flags.h for pageblock_order
>>> value when THP or HugeTLB are not used.
>>>
>>> Changes in v2:
>>> - Add Zi's Acked-by tag.
>>> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>>> per Zi and Matthew suggestion so it is available to
>>> all the architectures.
>>> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>>> ARCH_FORCE_MAX_ORDER is not available.
>>>
>>> include/linux/mmzone.h | 16 ++++++++++++++++
>>> include/linux/pageblock-flags.h | 8 ++++----
>>> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
>>> mm/mm_init.c | 2 +-
>>> 4 files changed, 55 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 6ccec1bf2896..05610337bbb6 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -37,6 +37,22 @@
>>> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>>> +/* Defines the order for the number of pages that have a migrate type. */
>>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>>> +#else
>>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>>> +
>>> +/*
>>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>>> + * which defines the order for the number of pages that can have a migrate type
>>> + */
>>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>>> +#endif
>>> +
>>> /*
>>> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>> * costly to service. That is between allocation orders which should
>>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>>> index fc6b9c87cb0a..e73a4292ef02 100644
>>> --- a/include/linux/pageblock-flags.h
>>> +++ b/include/linux/pageblock-flags.h
>>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>>> * Huge pages are a constant size, but don't exceed the maximum allocation
>>> * granularity.
>>> */
>>> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>>> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>>> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>>> -#define pageblock_order MAX_PAGE_ORDER
>>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>>> +#define pageblock_order PAGE_BLOCK_ORDER
>>> #endif /* CONFIG_HUGETLB_PAGE */
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index e113f713b493..13a5c4f6e6b6 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -989,6 +989,40 @@ config CMA_AREAS
>>> If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>> +#
>>> +# Select this config option from the architecture Kconfig, if available, to set
>>> +# the max page order for physically contiguous allocations.
>>> +#
>>> +config ARCH_FORCE_MAX_ORDER
>>> + int
>>> +
>>> +#
>>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>>> +# include/linux/mmzone.h.
>>> +#
>>> +config PAGE_BLOCK_ORDER
>>> + int "Page Block Order"
>>> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>>> + default 10 if ARCH_FORCE_MAX_ORDER = 0
>>> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>> + help
>>> + The page block order refers to the power of two number of pages that
>>> + are physically contiguous and can have a migrate type associated to
>>> + them. The maximum size of the page block order is limited by
>>> + ARCH_FORCE_MAX_ORDER.
>>> +
>>> + This config allows overriding the default page block order when the
>>> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>>> + or MAX_PAGE_ORDER.
>>> +
>>> + Reducing pageblock order can negatively impact THP generation
>>> + success rate. If your workloads uses THP heavily, please use this
>>> + option with caution.
>>> +
>>> + Don't change if unsure.
>>
>>
>> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>>
>> CONFIG_PAGE_BLOCK_ORDER=10
>>
>>
>> But then, we'll do this
>>
>> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>
>>
>> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>>
>> Confusing.
>>
>> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>
> IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
> size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.

LIMIT might be still ambiguous, since it can be lower limit or upper limit.
CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with,
if it looks good to you, I can send it out properly.

From 7fff4fd87ed3aa160db8d2f0d9e5b219321df4f9 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@xxxxxxxxxx>
Date: Tue, 3 Jun 2025 11:09:37 -0400
Subject: [PATCH] mm: rename CONFIG_PAGE_BLOCK_ORDER to
CONFIG_PAGE_BLOCK_ORDER_CEIL.

The config is in fact an additional upper limit of pageblock_order, so
rename it to avoid confusion.

Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
---
include/linux/mmzone.h | 14 +++++++-------
include/linux/pageblock-flags.h | 8 ++++----
mm/Kconfig | 15 ++++++++-------
3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 283913d42d7b..523b407e63e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,19 +38,19 @@
#define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)

/* Defines the order for the number of pages that have a migrate type. */
-#ifndef CONFIG_PAGE_BLOCK_ORDER
-#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
+#ifndef CONFIG_PAGE_BLOCK_ORDER_CEIL
+#define PAGE_BLOCK_ORDER_CEIL MAX_PAGE_ORDER
#else
-#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
-#endif /* CONFIG_PAGE_BLOCK_ORDER */
+#define PAGE_BLOCK_ORDER_CEIL CONFIG_PAGE_BLOCK_ORDER_CEIL
+#endif /* CONFIG_PAGE_BLOCK_ORDER_CEIL */

/*
* The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
- * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
+ * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER_CEIL,
* which defines the order for the number of pages that can have a migrate type
*/
-#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
-#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
+#if (PAGE_BLOCK_ORDER_CEIL > MAX_PAGE_ORDER)
+#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER_CEIL
#endif

/*
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index e73a4292ef02..e7a86cd238c2 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
* Huge pages are a constant size, but don't exceed the maximum allocation
* granularity.
*/
-#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
+#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER_CEIL)

#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)

-#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
+#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER_CEIL)

#else /* CONFIG_TRANSPARENT_HUGEPAGE */

-/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
-#define pageblock_order PAGE_BLOCK_ORDER
+/* If huge pages are not used, group by PAGE_BLOCK_ORDER_CEIL */
+#define pageblock_order PAGE_BLOCK_ORDER_CEIL

#endif /* CONFIG_HUGETLB_PAGE */

diff --git a/mm/Kconfig b/mm/Kconfig
index eccb2e46ffcb..3b27e644bd1f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1017,8 +1017,8 @@ config ARCH_FORCE_MAX_ORDER
# the default page block order is MAX_PAGE_ORDER (10) as per
# include/linux/mmzone.h.
#
-config PAGE_BLOCK_ORDER
- int "Page Block Order"
+config PAGE_BLOCK_ORDER_CEIL
+ int "Page Block Order Upper Limit"
range 1 10 if ARCH_FORCE_MAX_ORDER = 0
default 10 if ARCH_FORCE_MAX_ORDER = 0
range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
@@ -1026,12 +1026,13 @@ config PAGE_BLOCK_ORDER
help
The page block order refers to the power of two number of pages that
are physically contiguous and can have a migrate type associated to
- them. The maximum size of the page block order is limited by
- ARCH_FORCE_MAX_ORDER.
+ them. The maximum size of the page block order is at least limited by
+ ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER.

- This config allows overriding the default page block order when the
- page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
- or MAX_PAGE_ORDER.
+ This config adds a new upper limit of default page block
+ order when the page block order is required to be smaller than
+ ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER or other limits
+ (see include/linux/pageblock-flags.h for details).

Reducing pageblock order can negatively impact THP generation
success rate. If your workloads uses THP heavily, please use this
--
2.47.2



Best Regards,
Yan, Zi