Re: [PATCH v2 8/8] ext4: enable large folio for regular file
From: D, Suneeth
Date: Thu Jun 26 2025 - 07:30:13 EST
Hello Zhang Yi,
On 5/12/2025 12:03 PM, Zhang Yi wrote:
From: Zhang Yi <yi.zhang@xxxxxxxxxx>
Besides fsverity, fscrypt, and the data=journal mode, ext4 now supports
large folios for regular files. Enable this feature by default. However,
since we cannot change the folio order limitation of mappings on active
inodes, setting the journal=data mode via ioctl on an active inode will
not take immediate effect in non-delalloc mode.
We run lmbench3 as part of our Weekly CI for the purpose of Kernel
Performance Regression testing between a stable vs rc kernel. We noticed
a regression on the kernels starting from 6.16-rc1 all the way through
6.16-rc3 in the range of 8-12%. Further bisection b/w 6.15 and 6.16-rc1
pointed me to the first bad commit as
7ac67301e82f02b77a5c8e7377a1f414ef108b84. The following were the machine
configurations and test parameters used:-
Model name: AMD EPYC 9754 128-Core Processor [Bergamo]
Thread(s) per core: 2
Core(s) per socket: 128
Socket(s): 1
Total online memory: 258G
micro-benchmark_variant: "lmbench3-development-1-0-MMAP-50%" which has
the following parameters,
-> nr_thread: 1
-> memory_size: 50%
-> mode: development
-> test: MMAP
The following are the stats after bisection:-
(the KPI used here is lmbench3.MMAP.read.latency.us)
v6.15 - 97.3K
v6.16-rc1 - 107.5K
v6.16-rc3 - 107.4K
6.15.0-rc4badcommit - 103.5K
6.15.0-rc4badcommit_m1 (one commit before bad-commit) - 94.2K
I also ran the micro-benchmark with tools/testing/perf record and
following is the output from tools/testing/perf diff b/w the bad commit
and just one commit before that.
# ./perf diff perf.data.old perf.data
No kallsyms or vmlinux with build-id
da8042fb274c5e3524318e5e3afbeeef5df2055e was found
# Event 'cycles:P'
#
# Baseline Delta Abs Shared Object Symbol
>
# ........ ......... .......................
....................................................................................................................................................................................>
#
+4.34% [kernel.kallsyms] [k] __lruvec_stat_mod_folio
+3.41% [kernel.kallsyms] [k] unmap_page_range
+3.33% [kernel.kallsyms] [k]
__mod_memcg_lruvec_state
+2.04% [kernel.kallsyms] [k] srso_alias_return_thunk
+2.02% [kernel.kallsyms] [k] srso_alias_safe_ret
22.22% -1.78% bw_mmap_rd [.] bread
+1.76% [kernel.kallsyms] [k] __handle_mm_fault
+1.70% [kernel.kallsyms] [k] filemap_map_pages
+1.58% [kernel.kallsyms] [k] set_pte_range
+1.58% [kernel.kallsyms] [k] next_uptodate_folio
+1.33% [kernel.kallsyms] [k] do_anonymous_page
+1.01% [kernel.kallsyms] [k] get_page_from_freelist
+0.98% [kernel.kallsyms] [k] __mem_cgroup_charge
+0.85% [kernel.kallsyms] [k] asm_exc_page_fault
+0.82% [kernel.kallsyms] [k] native_irq_return_iret
+0.82% [kernel.kallsyms] [k] do_user_addr_fault
+0.77% [kernel.kallsyms] [k] clear_page_erms
+0.75% [kernel.kallsyms] [k] handle_mm_fault
+0.73% [kernel.kallsyms] [k] set_ptes.isra.0
+0.70% [kernel.kallsyms] [k] lru_add
+0.69% [kernel.kallsyms] [k]
folio_add_file_rmap_ptes
+0.68% [kernel.kallsyms] [k] folio_remove_rmap_ptes
12.45% -0.65% line [.] mem_benchmark_0
+0.64% [kernel.kallsyms] [k]
__alloc_frozen_pages_noprof
+0.63% [kernel.kallsyms] [k] vm_normal_page
+0.63% [kernel.kallsyms] [k]
free_pages_and_swap_cache
+0.63% [kernel.kallsyms] [k] lock_vma_under_rcu
+0.60% [kernel.kallsyms] [k] __rcu_read_unlock
+0.59% [kernel.kallsyms] [k] cgroup_rstat_updated
+0.57% [kernel.kallsyms] [k] get_mem_cgroup_from_mm
+0.52% [kernel.kallsyms] [k] __mod_lruvec_state
+0.51% [kernel.kallsyms] [k] exc_page_fault
Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
---
fs/ext4/ext4.h | 1 +
fs/ext4/ext4_jbd2.c | 3 ++-
fs/ext4/ialloc.c | 3 +++
fs/ext4/inode.c | 20 ++++++++++++++++++++
4 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5a20e9cd7184..2fad90c30493 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2993,6 +2993,7 @@ int ext4_walk_page_buffers(handle_t *handle,
struct buffer_head *bh));
int do_journal_get_write_access(handle_t *handle, struct inode *inode,
struct buffer_head *bh);
+bool ext4_should_enable_large_folio(struct inode *inode);
#define FALL_BACK_TO_NONDELALLOC 1
#define CONVERT_INLINE_DATA 2
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 135e278c832e..b3e9b7bd7978 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -16,7 +16,8 @@ int ext4_inode_journal_mode(struct inode *inode)
ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
(ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
- !test_opt(inode->i_sb, DELALLOC))) {
+ !test_opt(inode->i_sb, DELALLOC) &&
+ !mapping_large_folio_support(inode->i_mapping))) {
/* We do not support data journalling for encrypted data */
if (S_ISREG(inode->i_mode) && IS_ENCRYPTED(inode))
return EXT4_INODE_ORDERED_DATA_MODE; /* ordered */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index e7ecc7c8a729..4938e78cbadc 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1336,6 +1336,9 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
}
}
+ if (ext4_should_enable_large_folio(inode))
+ mapping_set_large_folios(inode->i_mapping);
+
ext4_update_inode_fsync_trans(handle, inode, 1);
err = ext4_mark_inode_dirty(handle, inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 29eccdf8315a..7fd3921cfe46 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4774,6 +4774,23 @@ static int check_igot_inode(struct inode *inode, ext4_iget_flags flags,
return -EFSCORRUPTED;
}
+bool ext4_should_enable_large_folio(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (!S_ISREG(inode->i_mode))
+ return false;
+ if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
+ ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
+ return false;
+ if (ext4_has_feature_verity(sb))
+ return false;
+ if (ext4_has_feature_encrypt(sb))
+ return false;
+
+ return true;
+}
+
struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
ext4_iget_flags flags, const char *function,
unsigned int line)
@@ -5096,6 +5113,9 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
ret = -EFSCORRUPTED;
goto bad_inode;
}
+ if (ext4_should_enable_large_folio(inode))
+ mapping_set_large_folios(inode->i_mapping);
+
ret = check_igot_inode(inode, flags, function, line);
/*
* -ESTALE here means there is nothing inherently wrong with the inode,
---
Thanks and Regards,
Suneeth DSteps to run lmbench3
1. git clone https://github.com/intel/lmbench.git
2. git clone https://github.com/intel/lkp-tests.git
3. cd lmbench
4. git apply lkp-tests/programs/lmbench3/pkg/lmbench3.patch
5. make
6. sed -i '/lat_pagefault -P no/i [ -f no ] || dd if=/dev/zero of=no count=1 bs=1G' bin/x86_64-linux-gnu/lmbench
7. (
echo 1
echo 1
echo 10240
echo development
echo no
echo no
echo no
echo no
echo no
echo yes
echo no
echo no
echo no
echo no
echo no
echo no
echo no
echo no
echo no
echo no
echo no
echo yes
echo
echo
echo
[ 1 -eq 1 ] && echo
echo no
) | make results
8. cd results/ && make