Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes

From: David Hildenbrand
Date: Fri Jun 06 2025 - 13:37:55 EST

Next message: Helge Deller: "Re: [PATCH] nvidiafb: fix build on 32-bit ARCH=um"
Previous message: Max Kellermann: "Re: [PATCH] fs/ceph/io: make ceph_start_io_*() killable"
Next in thread: Lorenzo Stoakes: "Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 06.06.25 16:37, Usama Arif wrote:

On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
watermarks are evaluated to extremely high values, for e.g. a server with
480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
of the sizes set to never, the min, low and high watermarks evaluate to
11.2G, 14G and 16.8G respectively.
In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
and 1G respectively.
This is because set_recommended_min_free_kbytes is designed for PMD
hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
Such high watermark values can cause performance and latency issues in
memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
most of them would never actually use a 512M PMD THP.

Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
folio order enabled in set_recommended_min_free_kbytes.
With this patch, when only 2M THP hugepage size is set to madvise for the
same machine with 64K page size, with the rest of the sizes set to never,
the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
respectively. When 512M THP hugepage size is set to madvise for the same
machine with 64K page size, the min, low and high watermarks evaluate to
11.2G, 14G and 16.8G respectively, the same as without this patch.

An alternative solution would be to change PAGE_BLOCK_ORDER by changing
ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
is not dynamic with hugepage size, will need different kernel builds for
different hugepage sizes and most users won't know that this needs to be
done as it can be difficult to detmermine that the performance and latency
issues are coming from the high watermark values.

All watermark numbers are for zones of nodes that had the highest number
of pages, i.e. the value for min size for 4K is obtained using:
cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
and for 64K using:
cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';

An arbirtary min of 128 pages is used for when no hugepage sizes are set
enabled.

Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
---
include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
mm/khugepaged.c | 32 ++++++++++++++++++++++++++++----
mm/shmem.c | 29 +++++------------------------
3 files changed, 58 insertions(+), 28 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..fb4e51ef0acb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
}
#endif
+/*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
+ *
+ * SHMEM_HUGE_NEVER:
+ * disables huge pages for the mount;
+ * SHMEM_HUGE_ALWAYS:
+ * enables huge pages for the mount;
+ * SHMEM_HUGE_WITHIN_SIZE:
+ * only allocate huge pages if the page will be fully within i_size,
+ * also respect madvise() hints;
+ * SHMEM_HUGE_ADVISE:
+ * only allocate huge pages if requested with madvise();
+ */
+
+ #define SHMEM_HUGE_NEVER 0
+ #define SHMEM_HUGE_ALWAYS 1
+ #define SHMEM_HUGE_WITHIN_SIZE 2
+ #define SHMEM_HUGE_ADVISE 3
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern unsigned long transparent_hugepage_flags;
@@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
extern unsigned long huge_anon_orders_madvise;
extern unsigned long huge_anon_orders_inherit;
+extern int shmem_huge __read_mostly;
+extern unsigned long huge_shmem_orders_always;
+extern unsigned long huge_shmem_orders_madvise;
+extern unsigned long huge_shmem_orders_inherit;
+extern unsigned long huge_shmem_orders_within_size;

Do really all of these have to be exported?

+
static inline bool hugepage_global_enabled(void)
{
return transparent_hugepage_flags &
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15203ea7d007..e64cba74eb2a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
return 0;
}
+static int thp_highest_allowable_order(void)

Did you mean "largest" ?

+{
+ unsigned long orders = READ_ONCE(huge_anon_orders_always)
+ | READ_ONCE(huge_anon_orders_madvise)
+ | READ_ONCE(huge_shmem_orders_always)
+ | READ_ONCE(huge_shmem_orders_madvise)
+ | READ_ONCE(huge_shmem_orders_within_size);
+ if (hugepage_global_enabled())
+ orders |= READ_ONCE(huge_anon_orders_inherit);
+ if (shmem_huge != SHMEM_HUGE_NEVER)
+ orders |= READ_ONCE(huge_shmem_orders_inherit);
+
+ return orders == 0 ? 0 : fls(orders) - 1;
+}

But how does this interact with large folios / THPs in the page cache?

+
+static unsigned long min_thp_pageblock_nr_pages(void)

Reading the function name, I have no idea what this function is supposed to do.

--
Cheers,

David / dhildenb

Next message: Helge Deller: "Re: [PATCH] nvidiafb: fix build on 32-bit ARCH=um"
Previous message: Max Kellermann: "Re: [PATCH] fs/ceph/io: make ceph_start_io_*() killable"
Next in thread: Lorenzo Stoakes: "Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]