On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:Hi Ojaswin,
On 2025/5/28 22:53, Ojaswin Mujoo wrote:<...>
On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@xxxxxxxxxxxxxxx wrote:
From: Baokun Li <libaokun1@xxxxxxxxxx>
Hi Baokun,While this approach might slightly increase free space fragmentation on|--------|--------|--------|--------|--------|--------|--------|--------|Hey Baokun, nice improvements! The proposed changes make sense to me,
| - | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| base | 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| linear | 286328 | 123102 | 119542 | 90653 | 60344 | 35302 | 23280 |
| | -3.0% | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
|--------|--------|--------|--------|--------|--------|--------|--------|
|mb_optim| 292498 | 133305 | 103069 | 61727 | 29702 | 16845 | 10430 |
|ize_scan| -0.9% | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
|--------|--------|--------|--------|--------|--------|--------|--------|
however I suspect the performance improvements may come at a cost of
slight increase in fragmentation, which might affect rotational disks
especially. Maybe comparing e2freefrag numbers with and without the
patches might give a better insight into this.
the disk, it significantly reduces file fragmentation, leading to faster
read speeds on rotational disks.
When multiple processes contend for free blocks within the same block
group, the probability of blocks allocated by the same process being
merged on consecutive allocations is low. This is because other processes
may preempt the free blocks immediately following the current process's
last allocated region.
Normally, we rely on preallocation to avoid files becoming overly
fragmented (even though preallocation itself can cause fragmentation in
free disk space). But since fallocate doesn't support preallocation, the
fragmentation issue is more pronounced. Counterintuitively, skipping busy
groups actually boosts opportunities for file extent merging, which in turn
reduces overall file fragmentation.
Referencing will-it-scale/fallocate2, I tested 64 processes each appending
4KB via fallocate to 64 separate files until they reached 1GB. This test
specifically examines contention in block allocation, unlike fallocate2,
it omits the contention between fallocate and truncate. Preliminary results
are provided below; detailed scripts and full test outcomes are attached in
the email footer.
----------------------------------------------------------
| base | patched |
---------------------|--------|--------|--------|--------|
mb_optimize_scan | linear |opt_scan| linear |opt_scan|
---------------------|--------|--------|--------|--------|
bw(MiB/s) | 217 | 219 | 5685 | 5670 |
Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
Avg. extents per file| 261879 | 262039 | 744 | 2084 |
Avg. size per extent | 4 | 4 | 1408 | 503 |
Fragmentation score | 100 | 100 | 2 | 6 |
----------------------------------------------------------
Thanks for the info and data and apologies for being late, I caught a
viral last week :/
Yes.
These numbers look pretty interesting and your explanation of why the
fragmentation is better makes sense. It is definitely a win-win then for
performance and fragmentation!
Thanks for the scripts.Regardless the performance benefits are significant and I feel it isOkay, thank you for your feedback.
good to have these patches.
I'll give my reviews individually as I'm still going through patch 2
However, I wanted to check on a couple things:
1. I believe you ran these in docker. Would you have any script etc openYes, these two patches primarily mitigate contention between block
sourced that I can use to run some benchmarks on my end (and also
understand your test setup).
allocations and between block allocation and release. The testing script
can be referenced from the fio script mentioned earlier in the email
footer. You can also add more truncate calls based on it.
Hmm, right the non ordered traversal has led to other issues as well in2. I notice we are getting way less throughput in mb_optimize_scan? IThe throughput of mb_optimize_scan is indeed much lower, and we continue
wonder why that is the case. Do you have some data on that? Are your
tests starting on an empty FS, maybe in that case linear scan works a
bit better since almost all groups are empty. If so, what are the
numbers like when we start with a fragmented FS?
to optimize it, as mb_optimize_scan is the default mount option and
performs better in scenarios with large volume disks and high space usage.
Disk space used is about 7%; mb_optimize_scan should perform better with
less free space. However, this isn't the critical factor. The poor
throughput here is due to the following reasons。
One reason is that mb_optimize_scan's list traversal is unordered and
always selects the first group.
While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_lock_group impossible. This can lead to a "bouncing"
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.
In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.
the past.
Another reason is that opt_scan tends to allocate from groups that haveBy just received free blocks, you mean the blocks got freed in the group
just received freed blocks, causing it to constantly "jump around"
between certain groups.
This leads to intense contention between allocation and release, and even
between release events. In contrast, linear traversal always moves forward
without revisiting groups, resulting in less contention between allocation
and release.
right?
I was under the impression that when we free blocks and the group'sYes, we are indeed adding the group to the list tail. However, because
largest order/ avg fragment changes, the group is added to the end of
the free fragment list/order list so it should be the last to be picked.
Is that not the case?
Yes, V2 will include those changes.
However, because linear involves more groups in allocation, journalDo you maybe have some code you can share of this?
becomes a bottleneck. If opt_scan first attempts to traverse block groups
to the right from the target group in all lists, and then from index 0 to
the left in all lists, competition between block groups would be
significantly reduced.
To enable ordered traversal, we attempted to convert list_head to an
ordered xarray. This ordering prevents "bouncing" during retries.
Additionally, traversing all right-side groups before left-side groups
significantly reduced contention. Performance improved from 10430 to 17730.
Yes, mb_optimize_scan performs well when free space fragmentation is
However, xarray traversal introduces overhead; list_head group selectionAlso, as per your theory, if we do similar experiments in a fragmented
was O(1), while xarray becomes O(n log n). This results in a ~10%
performance drop in single-process scenarios, and I'm not entirely sure if
this trade-off is worthwhile. 🤔
Additionally, by attempting to merge before inserting in
ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock
during merges, resulting in roughly a 5% performance gain.
- Or maybe it is that the lazyinit thread has not yet initialized allAll groups are already initialized at the time of testing, and that's not
the buddies yet which means we have lesser BGs in the freefrag list
or the order list used by faster CRs. Hence, if they are locked we
are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
the lines) before the benchmark, which forces init of all the group
buddies thus populating all the lists used by mb_opt_scan. Maybe we
can check if this gives better results.
where the problem lies.
3. Also, how much IO are we doing here, are we filling the whole FS?In a single container, create a file, then repeatedly append 8KB using
fallocate until the file reaches 1MB. After that, truncate the file to
0 and continue appending 8KB with fallocate. The 64 containers will
occupy a maximum of 64MB of disk space in total, so they won't fill the
entire file system.
FS we should see opt_scan perform better right? I'd like to test this as
well. I'll try to play with the scripts you have shared.