Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation

From: Baokun Li
Date: Tue Jun 10 2025 - 09:50:06 EST

Next message: Nick Chan: "[PATCH v2 8/9] arm64: dts: apple: t8011: Add I2C nodes"
Previous message: Jon Hunter: "Re: [PATCH V2 0/2] Disable periodic tracking on Tegra234"
In reply to: Ojaswin Mujoo: "Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation"
Next in thread: Ojaswin Mujoo: "Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2025/6/10 20:06, Ojaswin Mujoo wrote:

On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:

On 2025/5/28 22:53, Ojaswin Mujoo wrote:

On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@xxxxxxxxxxxxxxx wrote:

From: Baokun Li <libaokun1@xxxxxxxxxx>

<...>

|--------|--------|--------|--------|--------|--------|--------|--------|
| - | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| base | 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| linear | 286328 | 123102 | 119542 | 90653 | 60344 | 35302 | 23280 |
| | -3.0% | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
|--------|--------|--------|--------|--------|--------|--------|--------|
|mb_optim| 292498 | 133305 | 103069 | 61727 | 29702 | 16845 | 10430 |
|ize_scan| -0.9% | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
|--------|--------|--------|--------|--------|--------|--------|--------|

Hey Baokun, nice improvements! The proposed changes make sense to me,
however I suspect the performance improvements may come at a cost of
slight increase in fragmentation, which might affect rotational disks
especially. Maybe comparing e2freefrag numbers with and without the
patches might give a better insight into this.

While this approach might slightly increase free space fragmentation on
the disk, it significantly reduces file fragmentation, leading to faster
read speeds on rotational disks.

When multiple processes contend for free blocks within the same block
group, the probability of blocks allocated by the same process being
merged on consecutive allocations is low. This is because other processes
may preempt the free blocks immediately following the current process's
last allocated region.

Normally, we rely on preallocation to avoid files becoming overly
fragmented (even though preallocation itself can cause fragmentation in
free disk space). But since fallocate doesn't support preallocation, the
fragmentation issue is more pronounced. Counterintuitively, skipping busy
groups actually boosts opportunities for file extent merging, which in turn
reduces overall file fragmentation.

Referencing will-it-scale/fallocate2, I tested 64 processes each appending
4KB via fallocate to 64 separate files until they reached 1GB. This test
specifically examines contention in block allocation, unlike fallocate2,
it omits the contention between fallocate and truncate. Preliminary results
are provided below; detailed scripts and full test outcomes are attached in
the email footer.

----------------------------------------------------------
                     |       base      |      patched    |
---------------------|--------|--------|--------|--------|
mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
---------------------|--------|--------|--------|--------|
bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
Avg. extents per file| 261879 | 262039 | 744    | 2084   |
Avg. size per extent | 4      | 4      | 1408   | 503    |
Fragmentation score | 100    | 100    | 2      | 6      |
----------------------------------------------------------

Hi Baokun,

Thanks for the info and data and apologies for being late, I caught a
viral last week :/

Hi Ojaswin,

Being sick really takes a toll.
Please get some good rest and take care of yourself.

These numbers look pretty interesting and your explanation of why the
fragmentation is better makes sense. It is definitely a win-win then for
performance and fragmentation!

Regardless the performance benefits are significant and I feel it is
good to have these patches.

I'll give my reviews individually as I'm still going through patch 2
However, I wanted to check on a couple things:

Okay, thank you for your feedback.

1. I believe you ran these in docker. Would you have any script etc open
sourced that I can use to run some benchmarks on my end (and also
understand your test setup).

Yes, these two patches primarily mitigate contention between block
allocations and between block allocation and release. The testing script
can be referenced from the fio script mentioned earlier in the email
footer. You can also add more truncate calls based on it.

Thanks for the scripts.

2. I notice we are getting way less throughput in mb_optimize_scan? I
wonder why that is the case. Do you have some data on that? Are your
tests starting on an empty FS, maybe in that case linear scan works a
bit better since almost all groups are empty. If so, what are the
numbers like when we start with a fragmented FS?

The throughput of mb_optimize_scan is indeed much lower, and we continue
to optimize it, as mb_optimize_scan is the default mount option and
performs better in scenarios with large volume disks and high space usage.

Disk space used is about 7%; mb_optimize_scan should perform better with
less free space. However, this isn't the critical factor. The poor
throughput here is due to the following reasons。

One reason is that mb_optimize_scan's list traversal is unordered and
always selects the first group.

While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_lock_group impossible. This can lead to a "bouncing"
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.

In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.

Hmm, right the non ordered traversal has led to other issues as well in
the past.

Another reason is that opt_scan tends to allocate from groups that have
just received freed blocks, causing it to constantly "jump around"
between certain groups.

This leads to intense contention between allocation and release, and even
between release events. In contrast, linear traversal always moves forward
without revisiting groups, resulting in less contention between allocation
and release.

By just received free blocks, you mean the blocks got freed in the group
right?

Yes.

I was under the impression that when we free blocks and the group's
largest order/ avg fragment changes, the group is added to the end of
the free fragment list/order list so it should be the last to be picked.
Is that not the case?

Yes, we are indeed adding the group to the list tail. However, because
we traverse all ordered lists from low to high, a group might end up
"bouncing" repeatedly between order_0 and order_1 here.

For instance, if order_1 only contains group 1, linear traversal would
rarely revisit it after the initial pass. However, after a non-linear
allocation, this group is moved from the order_1 list to the order_0 list.
When blocks are freed, it's moved back to the order_1 list, and then
non-linear traversal prioritizes allocation in this same group again...

However, because linear involves more groups in allocation, journal
becomes a bottleneck. If opt_scan first attempts to traverse block groups
to the right from the target group in all lists, and then from index 0 to
the left in all lists, competition between block groups would be
significantly reduced.

To enable ordered traversal, we attempted to convert list_head to an
ordered xarray. This ordering prevents "bouncing" during retries.
Additionally, traversing all right-side groups before left-side groups
significantly reduced contention. Performance improved from 10430 to 17730.

Do you maybe have some code you can share of this?

Yes, V2 will include those changes.

However, xarray traversal introduces overhead; list_head group selection
was O(1), while xarray becomes O(n log n). This results in a ~10%
performance drop in single-process scenarios, and I'm not entirely sure if
this trade-off is worthwhile. 🤔

Additionally, by attempting to merge before inserting in
ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock
during merges, resulting in roughly a 5% performance gain.

- Or maybe it is that the lazyinit thread has not yet initialized all
the buddies yet which means we have lesser BGs in the freefrag list
or the order list used by faster CRs. Hence, if they are locked we
are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
the lines) before the benchmark, which forces init of all the group
buddies thus populating all the lists used by mb_opt_scan. Maybe we
can check if this gives better results.

All groups are already initialized at the time of testing, and that's not
where the problem lies.

3. Also, how much IO are we doing here, are we filling the whole FS?

In a single container, create a file, then repeatedly append 8KB using
fallocate until the file reaches 1MB. After that, truncate the file to
0 and continue appending 8KB with fallocate. The 64 containers will
occupy a maximum of 64MB of disk space in total, so they won't fill the
entire file system.

Also, as per your theory, if we do similar experiments in a fragmented
FS we should see opt_scan perform better right? I'd like to test this as
well. I'll try to play with the scripts you have shared.

Yes, mb_optimize_scan performs well when free space fragmentation is
severe. We have a 600TB disk array where the write bandwidth is
approximately 5 GB/s when empty. When space utilization reaches 95%,
linear traversal drops bandwidth to 1 GB/s, whereas enabling
mb_optimize_scan restores it to 4.2 GB/s.

Cheers,
Baokun

Next message: Nick Chan: "[PATCH v2 8/9] arm64: dts: apple: t8011: Add I2C nodes"
Previous message: Jon Hunter: "Re: [PATCH V2 0/2] Disable periodic tracking on Tegra234"
In reply to: Ojaswin Mujoo: "Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation"
Next in thread: Ojaswin Mujoo: "Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]