On Tue, Jun 10, 2025 at 09:48:30PM +0800, Baokun Li wrote:Yes, it would be nice if we could switch at the right point in time so we
On 2025/6/10 20:06, Ojaswin Mujoo wrote:Thanks Baokun!
On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:Hi Ojaswin,
On 2025/5/28 22:53, Ojaswin Mujoo wrote:Hi Baokun,
On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@xxxxxxxxxxxxxxx wrote:----------------------------------------------------------
| base | patched |
---------------------|--------|--------|--------|--------|
mb_optimize_scan | linear |opt_scan| linear |opt_scan|
---------------------|--------|--------|--------|--------|
bw(MiB/s) | 217 | 219 | 5685 | 5670 |
Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
Avg. extents per file| 261879 | 262039 | 744 | 2084 |
Avg. size per extent | 4 | 4 | 1408 | 503 |
Fragmentation score | 100 | 100 | 2 | 6 |
----------------------------------------------------------
Thanks for the info and data and apologies for being late, I caught a
viral last week :/
Being sick really takes a toll.
Please get some good rest and take care of yourself.
<snip>
Right, I get what you mean now. Thanks.Yes.Another reason is that opt_scan tends to allocate from groups that haveBy just received free blocks, you mean the blocks got freed in the group
just received freed blocks, causing it to constantly "jump around"
between certain groups.
This leads to intense contention between allocation and release, and even
between release events. In contrast, linear traversal always moves forward
without revisiting groups, resulting in less contention between allocation
and release.
right?
I was under the impression that when we free blocks and the group'sYes, we are indeed adding the group to the list tail. However, because
largest order/ avg fragment changes, the group is added to the end of
the free fragment list/order list so it should be the last to be picked.
Is that not the case?
we traverse all ordered lists from low to high, a group might end up
"bouncing" repeatedly between order_0 and order_1 here.
For instance, if order_1 only contains group 1, linear traversal would
rarely revisit it after the initial pass. However, after a non-linear
allocation, this group is moved from the order_1 list to the order_0 list.
When blocks are freed, it's moved back to the order_1 list, and then
non-linear traversal prioritizes allocation in this same group again...
Okay sureYes, V2 will include those changes.However, because linear involves more groups in allocation, journalDo you maybe have some code you can share of this?
becomes a bottleneck. If opt_scan first attempts to traverse block groups
to the right from the target group in all lists, and then from index 0 to
the left in all lists, competition between block groups would be
significantly reduced.
To enable ordered traversal, we attempted to convert list_head to an
ordered xarray. This ordering prevents "bouncing" during retries.
Additionally, traversing all right-side groups before left-side groups
significantly reduced contention. Performance improved from 10430 to 17730.
<snip
Got it, thanks for confirming. Seems like in mostly empty FS linearYes, mb_optimize_scan performs well when free space fragmentation isIn a single container, create a file, then repeatedly append 8KB usingAlso, as per your theory, if we do similar experiments in a fragmented
fallocate until the file reaches 1MB. After that, truncate the file to
0 and continue appending 8KB with fallocate. The 64 containers will
occupy a maximum of 64MB of disk space in total, so they won't fill the
entire file system.
FS we should see opt_scan perform better right? I'd like to test this as
well. I'll try to play with the scripts you have shared.
severe. We have a 600TB disk array where the write bandwidth is
approximately 5 GB/s when empty. When space utilization reaches 95%,
linear traversal drops bandwidth to 1 GB/s, whereas enabling
mb_optimize_scan restores it to 4.2 GB/s.
traversal seems to do better. Makes me wonder if we should use some
heurestic to switch between linear and opt_scan based allocation, for
example opt_scan can be used if FS is 60% full or has a fragmentation
score of X. (or something like that..)
But Im also curious about your improved optimize scan implementationThe v2 version of the patch is currently being tested and if it goes
with ordered traversal, so lets see how that goes first.