Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs

From: Baolin Wang
Date: Thu Jul 17 2025 - 04:01:35 EST

Next message: Antonio Quartulli: "Re: [PATCH] pNFS: fix uninitialized pointer access"
Previous message: John Garry: "Re: [PATCH v7 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size"
Next in thread: David Hildenbrand: "Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2025/7/16 04:03, Hugh Dickins wrote:

On Wed, 9 Jul 2025, Lorenzo Stoakes wrote:

On Tue, Jul 08, 2025 at 03:53:56PM +0800, Baolin Wang wrote:

On 2025/7/7 21:33, Lorenzo Stoakes wrote:

On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:

On 2025/7/5 06:18, Andrew Morton wrote:

On Fri, 4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> wrote:

After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
tmpfs can also support large folio allocation (not just PMD-sized large
folios).

However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
we still establish mappings at the base page granularity, which is unreasonable.

We can map multiple consecutive pages of a tmpfs folios at once according to
the size of the large folio. On one hand, this can reduce the overhead of page
faults; on the other hand, it can leverage hardware architecture optimizations
to reduce TLB misses, such as contiguous PTEs on the ARM architecture.

Moreover, tmpfs mount will use the 'huge=' option to control large folio
allocation explicitly. So it can be understood that the process's RSS statistics
might increase, and I think this will not cause any obvious effects for users.

Performance test:
I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
sequentially via mmap(). I observed a significant performance improvement:

That doesn't sound like a crazy thing to do.

Before the patch:
real 0m0.158s
user 0m0.008s
sys 0m0.150s

After the patch:
real 0m0.021s
user 0m0.004s
sys 0m0.017s

And look at that.

diff --git a/mm/memory.c b/mm/memory.c
index 0f9b32a20e5b..9944380e947d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
/*
* Using per-page fault to maintain the uffd semantics, and same
- * approach also applies to non-anonymous-shmem faults to avoid
+ * approach also applies to non shmem/tmpfs faults to avoid
* inflating the RSS of the process.
*/
- if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
+ if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
unlikely(needs_fallback)) {
nr_pages = 1;
} else if (nr_pages > 1) {

and that's it?

I'm itching to get this into -stable, really. What LTS user wouldn't
want this?

This is an improvement rather than a bugfix, so I don't think it needs to go
into LTS.

Could it be viewed as correcting an oversight in

acd7ccb284b8?

Yes, I should have added this optimization in the series of the commit
acd7ccb284b8. But obviously, I missed this :(.

Buuut if this was an oversight for that patch that causes an unnecessary
perf degradation, surely this should have fixes tag + cc stable no?

IMO, this commit acd7ccb284b8 won't cause perf degradation, instead it is
used to introduce a new feature, while the current patch is a further
reasonable optimization. As I mentioned, this is an improvement, not a
bugfix or a patch to address performance regression.

4Well :) you say yourself it was an oversight, and it very clearly has a perf
_impact_, which if you compare backwards to acd7ccb284b8 is a degradation, but I
get your point.

However, since you say 'oversight' this seems to me that you really meant to
have included it but hadn't noticed, and additionally, since it just seems to be
an unequivical good - let's maybe flip this round - why NOT backport it to
stable?

I strongly agree with Baolin: this patch is good, thank you, but it is
a performance improvement, a new feature, not a candidate for the stable
tree. I'm surprised anyone thinks otherwise: Andrew, please delete that
stable tag before advancing the patch from mm-unstable to mm-stable.

And the Fixee went into 6.14, so it couldn't go to 6.12 LTS anyway.

Agree.

An unequivocal good? I'm not so sure.

I expect it ought to be limited, by fault_around_bytes (or suchlike).

If I understand all the mTHP versus large folio versus PMD-huge handling
correctly (and of course I do not, I'm still weeks if not months away
from understanding most of it), the old vma_is_anon_shmem() case would
be limited by the shmem mTHP tunables, and one can reasonably argue that
they would already take fault_around_bytes-like considerations into account;
but the newly added file-written cases are governed by huge= mount options
intended for PMD-size, but (currently) permitting all lesser orders.
I don't think that mounting a tmpfs huge=always implies that mapping
256 PTEs for one fault is necessarily a good strategy.

But looking in the opposite direction, why is there now a vma_is_shmem()
check there in finish_fault() at all? If major filesystems are using
large folios, why aren't they also allowed to benefit from mapping
multiple PTEs at once (in this shared-writable case which the existing
fault-around does not cover - I presume to avoid write amplification,
but that's not an issue when the folio is large already).

This is what I'm going to do next. For other filesystems, I think they should also map multiple PTEs at once. IIUC, I don't think this will lead to write amplification (because large folios are already allocated, it will just cause RSS to inflate, but I wonder if this actually causes any problem?), instead the current dirty tracking for mmap write access is per-folio (see iomap_page_mkwrite()), which can cause write amplification.

It's fine to advance cautiously, keeping this to shmem in coming release;
but I think it should be extended soon (without any backport to stable),
and consideration given to limiting it.

Yes, I will consider that. But fault_around_bytes is tricky, we can discuss it in subsequent patches. Thanks for your comments.

Next message: Antonio Quartulli: "Re: [PATCH] pNFS: fix uninitialized pointer access"
Previous message: John Garry: "Re: [PATCH v7 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size"
Next in thread: David Hildenbrand: "Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]