Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options

From: David Hildenbrand
Date: Wed Jul 30 2025 - 05:31:50 EST


On 30.07.25 10:14, Baolin Wang wrote:
After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
we have extended tmpfs to allow any sized large folios, rather than just
PMD-sized large folios.

The strategy discussed previously was:

"
Considering that tmpfs already has the 'huge=' option to control the
PMD-sized large folios allocation, we can extend the 'huge=' option to
allow any sized large folios. The semantics of the 'huge=' mount option
are:

huge=never: no any sized large folios
huge=always: any sized large folios
huge=within_size: like 'always' but respect the i_size
huge=advise: like 'always' if requested with madvise()

Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized huge folios if huge=always/within_size/advise is
set.

Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics. The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.
"

This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size',
tmpfs will allow getting a highest order hint based on the size of write() and
fallocate() paths. It will then try each allowable large order, rather than
continually attempting to allocate PMD-sized large folios as before.

However, this might break some user scenarios for those who want to use
PMD-sized large folios, such as the i915 driver which did not supply a write
size hint when allocating shmem [1].

Moreover, Hugh also complained that this will cause a regression in userspace
with 'huge=always' or 'huge=within_size'.

So, let's revisit the strategy for tmpfs large page allocation. A simple fix
would be to always try PMD-sized large folios first, and if that fails, fall
back to smaller large folios. However, this approach differs from the strategy
for large folio allocation used by other file systems. Is this acceptable?

My opinion so far has been that anon and shmem are different than ordinary FS'es ... primarily because allocation(readahead)+reclaim(writeback) behave differently.

There were opinions in the past that tmpfs should just behave like any other fs, and I think that's what we tried to satisfy here: use the write size as an indication.

I assume there will be workloads where either approach will be beneficial. I also assume that workloads that use ordinary fs'es could benefit from the same strategy (start with PMD), while others will clearly not.

So no real opinion, it all doesn't feel ideal ... at least with his approach here we would stick more to the old tmpfs behavior.

--
Cheers,

David / dhildenb