Baokun Li <libaokun1@xxxxxxxxxx> writes:As Ted mentioned, Ext4 has never guaranteed deterministic allocation. We
When ext4 allocates blocks, we used to just go through the block groupsIt seems this makes block allocation non deterministic, but depend on
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.
But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.
Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.
the system load. I can see where this could cause problems when
reproducing bugs at least, but perhaps also in other cases.
Better perhaps just round robin the groups?
Or at least add a way to turn it off.
-Andi