On Wed, 25 Jun 2025, Baolin Wang wrote:
When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
callers who do not specify this flag, it creates a odd and surprising situation
where a sysadmin specifying 'never' for all THP sizes still observing THP pages
being allocated and used on the system. And the MADV_COLLAPSE is an example of
such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
thp_vma_allowable_orders().
As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
the system-wide anon/shmem THP sysfs settings, which means that even though
we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
attempt to collapse into a anon/shmem THP. This violates the rule we have
agreed upon: never means never.
For example, system administrators who disabled THP everywhere must indeed very
much not want THP to be used for whatever reason - having individual programs
being able to quietly override this is very surprising and likely to cause headaches
for those who desire this not to happen on their systems.
This patch set will address the MADV_COLLAPSE issue.
Test
====
1. Tested the mm selftests and found no regressions.
2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
anonymous pages work well.
3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
shmem work well.
4. Tested the large order allocation for tmpfs, and works as expected.
[1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@xxxxxxxxxxxxxxxxx/
Changes from v3:
- Collect reviewed tags. Thanks.
- Update the commit message, per David.
Changes from v2:
- Update the commit message and cover letter, per Lorenzo. Thanks.
- Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.
Changes from v1:
- Update the commit message, per Zi.
- Add Zi's reviewed tag. Thanks.
- Update the shmem logic.
Baolin Wang (2):
mm: huge_memory: disallow hugepages if the system-wide THP sysfs
settings are disabled
mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
settings are disabled
include/linux/huge_mm.h | 51 ++++++++++++++++++-------
mm/shmem.c | 6 +--
tools/testing/selftests/mm/khugepaged.c | 8 +---
3 files changed, 43 insertions(+), 22 deletions(-)
--
2.43.5
Sorry for chiming in so late, after so much effort: but I beg you,
please drop these.
I did not want to get into a fight, and had been hoping a voice of
reason would come from others, before I got around to responding.
And indeed Ryan understood correctly at the start; and he, Usama
and Barry, perhaps others I've missed, have raised appropriate
concerns but not prevailed.
If we're sloganeering, I much prefer "never break userspace" to
"never means never", attractive though that over-simplification is.
Seldom has a feature been so thorougly documented as MADV_COLLAPSE,
in its 6.1 commits and in the "man 2 madvise" page: which are
explicit about MADV_COLLAPSE providing a way to get THPs where the
sysfs setting governing automatic behaviour does not insert them.
We would all prefer a less messy world of THP tunables. I certainly
find plenty to dislike there too; and wish that a less assertive name
than "never" had been chosen originally for the default off position.
But please don't break the accepted and documented behaviour of
MADV_COLLAPSE now.
If you want to exclude all possibility of THPs, then please use the
prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
who insisted that be respected by MADV_COLLAPSE back then).
Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
if you like. (But in these days of filesystem large folios, adding
new protections against them seems a few years late.)
If Andrew decides that these patches should go in, then I'll have to
scrutinize them more carefully than I've done so far: but currently
I'm hoping to avoid that.
Hugh