On Tue 01-07-25 10:39:53, Baokun Li wrote:Yes, it's true that the underlying implementation of
On 2025/7/1 0:32, Jan Kara wrote:Yes, I know these things. Not that I'd be really an expert in them but I'd
On Mon 30-06-25 17:21:48, Baokun Li wrote:Explaining why smp_load_acquire() and smp_store_release() guarantee the
On 2025/6/30 15:47, Jan Kara wrote:Yes, we agree on this.
On Mon 30-06-25 11:48:20, Baokun Li wrote:READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
On 2025/6/28 2:19, Jan Kara wrote:I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
On Mon 23-06-25 15:32:51, Baokun Li wrote:WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
After we optimized the block group lock, we found another lock...
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.
Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent, and we can ensure that the s_mb_last_group read is up
to date by using smp_store_release/smp_load_acquire.
Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
| Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 |
Disk: 960GB SSD |-------------------------|-------------------------|
| base | patched | base | patched |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 4821 | 7612 (+57.8%) | 15371 | 21647 (+40.8%) |
mb_optimize_scan=1 | 4784 | 7568 (+58.1%) | 6101 | 9117 (+49.4%) |
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.cDo you really need any kind of barrier (implied by smp_store_release())
index 5cdae3bda072..3f103919868b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
ac->ac_buddy_folio = e4b->bd_buddy_folio;
folio_get(ac->ac_buddy_folio);
/* store last allocated for subsequent stream allocation */
- if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
- spin_lock(&sbi->s_md_lock);
- sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
- spin_unlock(&sbi->s_md_lock);
- }
+ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+ /* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
+ smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
here? I mean the store to s_mb_last_group is perfectly fine to be reordered
with other accesses from the thread, isn't it? As such it should be enough
to have WRITE_ONCE() here...
that variable reads/writes access values directly from L1/L2 cache rather
than registers.
particular they force the compiler to read / write the memory location
exactly once instead of reading it potentially multiple times in different
parts of expression and getting inconsistent values, or possibly writing
the value say byte by byte (yes, that would be insane but not contrary to
the C standard).
two main purposes:
1. It tells the compiler that the variable's value can change unexpectedly,
preventing the compiler from making incorrect optimizations based on
assumptions about its stability.
2. It ensures the CPU directly reads from or writes to the variable's
memory address. This means the value will be fetched from cache (L1/L2)
if available, or from main memory otherwise, rather than using a stale
value from a CPU register.
Well, here I think you assume way more about the CPU architecture than issmp_load_acquire() / smp_store_release() guarantee that CPUs read theThey do not guarantee that other CPUs see the latest values. Reading staleBut smp_load_acquire() / smp_store_release() have no guarantee about CPU
values could lead to more useless traversals, which might incur higher
overhead than memory barriers. This is why we use memory barriers to ensure
the latest values are read.
seeing latest values either. They are just speculation barriers meaning
they prevent the CPU from reordering accesses in the code after
smp_load_acquire() to be performed before the smp_load_acquire() is
executed and similarly with smp_store_release(). So I dare to say that
these barries have no (positive) impact on the allocation performance and
just complicate the code - but if you have some data that show otherwise,
I'd be happy to be proven wrong.
latest data.
For example, imagine a variable a = 0, with both CPU0 and CPU1 having
a=0 in their caches.
Without a memory barrier:
When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
a=1 is written to main memory and becomes visible to other CPUs.
Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
its invalidation queue. However, it might not process it immediately;
instead, it could perform the read first, potentially still reading a=0
from its cache.
With a memory barrier:
When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
the store buffer, but data in the store buffer is also written to main
memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
to its invalidation queue. Here, the invalidation queue is flushed, which
invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
performs the read, its cache is invalid, so it reads the latest a=1 from
main memory.
generally true (and I didn't find what you write above guaranteed neither
by x86 nor by arm64 CPU documentation). Generally I'm following the
guarantees as defined by Documentation/memory-barriers.txt and there you
can argue only about order of effects as observed by different CPUs but not
really about when content is fetched to / from CPU caches.
latest data is read truly requires delving into their underlying
implementation details.
I suggest you Google "why memory barriers are needed." You might find
introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
MESI, store buffers, and invalidate queue, along with the stories behind
them.
call myself familiar enough :). But that is kind of besides the point here.
What I want to point out it that if you have code like:
some access A
grp = smp_load_acquire(&sbi->s_mb_last_group)
some more accesses
then the CPU is fully within it's right to execute them as:
grp = smp_load_acquire(&sbi->s_mb_last_group)
some access A
some more accesses
Now your *particular implementation* of the ARM64 CPU model may never do
that similarly as no x86 CPU currently does it but some other CPU
implementation may (e.g. Alpha CPU probably would, as much as that's
irrevelent these days :). So using smp_load_acquire() is at best a
heuristics that may happen to help using more fresh value for some CPU
models but it isn't guaranteed to help for all architectures and all CPU
models Linux supports.
So can you do me a favor please and do a performance comparison of using
READ_ONCE / WRITE_ONCE vs using smp_load_acquire / smp_store_release on
your Arm64 server for streaming goal management? If smp_load_acquire /
smp_store_release indeed bring any performance benefit for your servers, we
can just stick a comment there explaining why they are used. If they bring
no measurable benefit I'd put READ_ONCE / WRITE_ONCE there for code
simplicity. Do you agree?
Honza