Hello!Yes, that's right. I was referring to how streaming allocation worked
On Fri 30-05-25 17:31:48, Baokun Li wrote:
On 2025/5/29 20:56, Jan Kara wrote:I see. But doesn't ext4_mb_choose_next_group() usually select group from
On Fri 23-05-25 16:58:19, libaokun@xxxxxxxxxxxxxxx wrote:Sorry, I didn't intend to break streaming allocation's semantics.
From: Baokun Li <libaokun1@xxxxxxxxxx>Good spotting with the s_md_lock contention here. But your changes don't
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_[group|start]) when allocating.
Moreover, when allocating data blocks, if the first try (goal allocation)
fails and stream allocation is on, it tries a global goal starting from
the last group we used (s_mb_last_group). This can make things faster by
writing blocks close together on the disk. But when many processes are
allocating, they all fight over s_md_lock and might even try to use the
same group. This makes it harder to merge extents and can make files more
fragmented. If different processes allocate chunks of very different sizes,
the free space on the disk can also get fragmented. A small allocation
might fit in a partially full group, but a big allocation might have
skipped it, leading to the small IO ending up in a more empty group.
So, we're changing stream allocation to work per inode. First, it tries
the goal, then the last group where that inode successfully allocated a
block. This keeps an inode's data closer together. Plus, after moving
mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
block allocation anymore because we already have the write lock on
i_data_sem. This gets rid of the contention between allocating and
releasing blocks, which gives a huge performance boost to fallocate2.
Performance test data follows:
CPU: HUAWEI Kunpeng 920
Memory: 480GB
Disk: 480GB SSD SATA 3.2
Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
Observation: Average fallocate operations per container per second.
base patched
mb_optimize_scan=0 6755 23280 (+244.6%)
mb_optimize_scan=1 4302 10430 (+142.4%)
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
quite make sense to me. The idea of streaming allocation in mballoc is to
have an area of filesystem for large files to reduce fragmentation. When
you switch to per-inode, this effect of packing large files together goes
away. Futhermore for each inode either all allocations will be very likely
streaming or not streaming (the logic uses file size) so either your
per-inode target will be unused or just another constantly used copy of
goal value.
A precise definition of streaming allocation is not found in the
existing code, documentation, or historical records. Consequently,
my previous understanding of it was somewhat inaccurate.
I previously thought it was used to optimize the efficiency of linear
traversal. For instance, if 500 inodes are created in group 0 and each
file is sequentially filled to 1GB, each file's goal, being empty, would
be group 1 (the second group in the inode's flex_bg).
Without a global goal and in the absence of non-linear traversal,
after the first inode is filled, the second inode would need to traverse
groups 1 through 8 to find its first free block.
This inefficiency escalates, eventually requiring the 500th inode to
potentially traverse almost 4000 block groups to find its first available
block.
which allocation can succeed instead of linearly scanning through all the
groups? The linear scan is just a last resort as far as I remember.
AnywayHey Andreas, do you happen to remember what the initial thinking was
I'm not 100% sure what was the original motivation for the streaming goal.
Maybe Andreas would remember since he was involved in the design. What I
wrote is mostly derived from the general understanding of mballoc operating
principles but I could be wrong.
I initially believed it could be split to the inode level to reduceCool, let's see how it works out :).
traversal time and file fragmentation. However, as you pointed out,
its purpose is to cluster large files, not data blocks within a file.
Given this, splitting it to the inode level no longer makes sense.
So I can see two sensible solutions here:As mentioned above, it can also greatly improve the efficiency of linear
a) Drop streaming allocations support altogether.
traversal, so we can't simply remove it.
b) Enhance streaming allocation support to avoid contention betweenYes, since it's just a hint, we don't need a lock at all, not even
processes allocating in parallel and freeing. Frankly, there's no strong
reason why reads & writes of streaming allocation goal need to use a
spinlock AFAICS.
fe_start, we just need the last fe_group.
We could just store a physical block number and useSpot on! We did try a single, lockless atomic64 variable, and just as
atomic64 accessors for it? Also having single goal value is just causing
more contention on group locks for parallel writers that end up using it
(that's the problem I suspect you were hitting the most).
you pointed out, all processes started traversing from the very same
group, which just cranked up the contention, dropping OPS to just 6745.
So perhaps weThat's a brilliant idea, actually!
can keep multiple streaming goal slots in the superblock (scale the count
based on CPU count & filesystem group count) and just pick the slot based
on inode number hash to reduce contention?
Honza
Since most containers are CPU-pinned, this would naturally cluster a single
container's data blocks together. I believe we can also apply this to inode
allocation, so a container's inodes and data are all in a single region,
significantly reducing interference between containers.
My gratitude for your valuable suggestion!
I'm going to try out the CPU bucketing approach.