Re: [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group

From: Baokun Li
Date: Fri Jul 18 2025 - 21:54:58 EST


On 2025/7/17 21:36, Ojaswin Mujoo wrote:
On Mon, Jul 14, 2025 at 09:03:14PM +0800, Baokun Li wrote:
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.

Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
Hi Baokun,

So i just got curious of the difference between smp_load_acquire vs
READ_ONCE on PowerPC, another weak memory ordering arch.
Interestingly, I didn't see that big of a single threaded drop.

The number are as follows (mb_opt_scan=1):

100 threads
w/ smp_load_acquire 1668 MB/s
w/ READ_ONCE 1599 MB/s

1 thread pinned to 1 cpu
w/ smp_load_acquire 292 MB/s
w/ READ_ONCE 296 MB/s

Either ways, this is much better than the base which is around 500MB/s
but just thought I'd share it here

Thank you for providing the test data for PowerPC, it is true that
the results may vary slightly between architectures.


Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>

Thank you for you review!

Cheers,
Baokun

Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) |

|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) |

Suggested-by: Jan Kara <jack@xxxxxxx>
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
---
fs/ext4/ext4.h | 2 +-
fs/ext4/mballoc.c | 12 +++---------
2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b83095541c98..7f5c070de0fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1630,7 +1630,7 @@ struct ext4_sb_info {
unsigned int s_mb_group_prealloc;
unsigned int s_max_dir_size_kb;
/* where last allocation was done - for stream allocation */
- unsigned long s_mb_last_group;
+ ext4_group_t s_mb_last_group;
unsigned int s_mb_prefetch;
unsigned int s_mb_prefetch_limit;
unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e3a5103e1620..025b759ca643 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
ac->ac_buddy_folio = e4b->bd_buddy_folio;
folio_get(ac->ac_buddy_folio);
/* store last allocated for subsequent stream allocation */
- if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
- spin_lock(&sbi->s_md_lock);
- sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
- spin_unlock(&sbi->s_md_lock);
- }
+ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+ WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
/*
* As we've just preallocated more space than
* user requested originally, we store allocated
@@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
/* if stream allocation is enabled, use global goal */
if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
- /* TBD: may be hot point */
- spin_lock(&sbi->s_md_lock);
- ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
- spin_unlock(&sbi->s_md_lock);
+ ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
ac->ac_g_ex.fe_start = -1;
ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
}
--
2.46.1