Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation

From: Baokun Li
Date: Thu Jun 12 2025 - 07:31:15 EST


On 2025/6/11 16:22, Ojaswin Mujoo wrote:
On Tue, Jun 10, 2025 at 09:48:30PM +0800, Baokun Li wrote:
On 2025/6/10 20:06, Ojaswin Mujoo wrote:
On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:
On 2025/5/28 22:53, Ojaswin Mujoo wrote:
On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@xxxxxxxxxxxxxxx wrote:
----------------------------------------------------------
                     |       base      |      patched    |
---------------------|--------|--------|--------|--------|
mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
---------------------|--------|--------|--------|--------|
bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
Avg. extents per file| 261879 | 262039 | 744    | 2084   |
Avg. size per extent | 4      | 4      | 1408   | 503    |
Fragmentation score  | 100    | 100    | 2      | 6      |
----------------------------------------------------------
Hi Baokun,

Thanks for the info and data and apologies for being late, I caught a
viral last week :/
Hi Ojaswin,

Being sick really takes a toll.
Please get some good rest and take care of yourself.
Thanks Baokun!

<snip>

Another reason is that opt_scan tends to allocate from groups that have
just received freed blocks, causing it to constantly "jump around"
between certain groups.

This leads to intense contention between allocation and release, and even
between release events. In contrast, linear traversal always moves forward
without revisiting groups, resulting in less contention between allocation
and release.
By just received free blocks, you mean the blocks got freed in the group
right?
Yes.
I was under the impression that when we free blocks and the group's
largest order/ avg fragment changes, the group is added to the end of
the free fragment list/order list so it should be the last to be picked.
Is that not the case?
Yes, we are indeed adding the group to the list tail. However, because
we traverse all ordered lists from low to high, a group might end up
"bouncing" repeatedly between order_0 and order_1 here.

For instance, if order_1 only contains group 1, linear traversal would
rarely revisit it after the initial pass. However, after a non-linear
allocation, this group is moved from the order_1 list to the order_0 list.
When blocks are freed, it's moved back to the order_1 list, and then
non-linear traversal prioritizes allocation in this same group again...
Right, I get what you mean now. Thanks.

However, because linear involves more groups in allocation, journal
becomes a bottleneck. If opt_scan first attempts to traverse block groups
to the right from the target group in all lists, and then from index 0 to
the left in all lists, competition between block groups would be
significantly reduced.

To enable ordered traversal, we attempted to convert list_head to an
ordered xarray. This ordering prevents "bouncing" during retries.
Additionally, traversing all right-side groups before left-side groups
significantly reduced contention. Performance improved from 10430 to 17730.
Do you maybe have some code you can share of this?
Yes, V2 will include those changes.
Okay sure

<snip

In a single container, create a file, then repeatedly append 8KB using
fallocate until the file reaches 1MB. After that, truncate the file to
0 and continue appending 8KB with fallocate. The 64 containers will
occupy a maximum of 64MB of disk space in total, so they won't fill the
entire file system.
Also, as per your theory, if we do similar experiments in a fragmented
FS we should see opt_scan perform better right? I'd like to test this as
well. I'll try to play with the scripts you have shared.

Yes, mb_optimize_scan performs well when free space fragmentation is
severe. We have a 600TB disk array where the write bandwidth is
approximately 5 GB/s when empty. When space utilization reaches 95%,
linear traversal drops bandwidth to 1 GB/s, whereas enabling
mb_optimize_scan restores it to 4.2 GB/s.
Got it, thanks for confirming. Seems like in mostly empty FS linear
traversal seems to do better. Makes me wonder if we should use some
heurestic to switch between linear and opt_scan based allocation, for
example opt_scan can be used if FS is 60% full or has a fragmentation
score of X. (or something like that..)
Yes, it would be nice if we could switch at the right point in time so we
could have the best of both allocations. But we need some data to support
us in choosing the right strategy, but that can come later.
But Im also curious about your improved optimize scan implementation
with ordered traversal, so lets see how that goes first.

The v2 version of the patch is currently being tested and if it goes
well, it could be sent out next week.


Cheers,
Baokun