RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
From: Wlodarczyk, Bertrand
Date: Fri Jul 04 2025 - 09:14:18 EST
> > > Also the response to the tearing issue explained by JP is not satisfying.
> >
> > In other words, the claim is: "it's better to stall other cpus in
> > spinlock plus disable IRQ every time in order to serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
> > In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.
>
> > This is a false choice, I think. e.g. We can easily use seqlock to remove strict synchronization only from user side, right?
>
> Yes, that's second possibility to solve a problem.
> I choose atomics approach because, in my opinion, incremental statistics are somewhat natural use case for them.
>They're good for individual counters but I'm not sure they're natural fit for a group of stats.
It depends on what we consider as a group of stats.
In case of rstat cpu we have statistics printed together but independently counted.
The only exception is sum_exec_runtime which existence purpose in rstats is unknown for me.
>From what I can observe and read in code (in rstat context, know about usage in sched) it's just a sum of stime and utime.
> A series of atomic ops can be significantly more expensive than locked updates and it also comes with problems like split updates as discussed in this thread. I think most of resistance is from the use of atomics.
I just can't see what's the issue with split updates.
To illustrate let's perform an example thought experiment.
Let's assume we have three stats A, B, C initialized to 0 and execution is set on 0 unit of time.
We have two cpus which one want to perform read operation and second write operation at the same time.
Cost in time is 16 units per update operation, 2 per read.
Let's start the read with 16 units of time delay but write without any.
Lock scenario:
Init: A = 0, B = 0 , C = 0
cpu 0 - write {
lock
A = 1
B = 1
C = 1
unlock
}
48 units elapsed
16 units delayed
Cpu 1 - read {
spins 32 units
lock
read(A) // 1
read(B) // 1
read(C) // 1
unlock
}
Result is A = 1, B = 1, C =1 after 54 units of time elapsed.
Atomic scenario:
Init: A = 0, B = 0, C = 0
cpu 0 - write {
A = 1 // 16 units elapsed
B = 1 // 32 units elapsed
C = 1 // 48 units elapsed
}
16 units delayed
Cpu 1 - read {
read(A) // A = 1 // 18 units elapsed
read(B) // B = 0 // 20 units elapsed
read(C) // C = 0 // 22 units elapsed
}
Result is A = 1, B = 0, C = 0 after 22 units of time elapsed.
48 units delayed
Cpu 1 - read {
read(A) // A = 1 // 50 units elapsed
read(B) // B = 1 // 52 units elapsed
read(C) // C = 1 // 54 units elapsed
}
Result is A = 1, B = 1, C = 1 after 54 units of time elapsed.
The difference is that with atomics you have access to the fresher state of A statistic in 18 unit of time.
After 54 units both solutions have the same result.
What's the issue here? Why user seeing A = 1, B = 0, C = 0 in 22 unit (instead of spin) is a bad thing in rstat scenario?
Moreover, I consider lock solution as inferior due to complexity.
For example, currently the function of ss_rstat_lock can return one of two locks.
Global rstat_base_lock or lock from cgroup_subsys. Lock for cgroup_subsys is initiated in ss_rstat_init.
When? Why sometimes during flushes we are acquiring global lock and sometimes in happier scenario cgroup_subsys lock?
How to generalize performance solution here? Context of states and scopes to remember when crafting a solution is larger in comparison to atomics.
Even now, mainline implementation still has race conditions. I know that these are worked on right now but it's a sign of solution complexity which essentially atomics clear.
The patch removes more code then adds. Have much better resistance to stress than lock.
The code handling 32 bit __u64_stats_fetch is gone because it's not needed.
Makes everything simpler to reason, scopes are easy to grasp, opens opportunities to further improvements in submodules and clears all races detected by KCSAN.
> Can you please try a different approach?
In last few days I've investigated this, have some success but nowhere near to the improvements yield by atomics use.
For the reasons I mentioned above, locks approach is much more complex to optimize.
> > I wouldn't be addressing this issue if there were no customers
> > affected by rstat latency in multi-container multi-cpu scenarios.
>
> > Out of curiosity, can you explain the case that you observed in more detail?
> > What were the customer doing?
>
> Single hierarchy, hundreds of the containers on one server, multiple independent owners.
> Some of them wants to have current stats available in their webgui.
> They are hammering the stats for their cgroups.
> Server experience inefficiencies, perf shows visible percentage of cpu cycles spent in cgroup_rstat_flush.
>
> I prepared benchmark which can be example of the issue faced by the customer:
> https://gist.github.com/bwlodarcz/21bbc24813bced8e6ffc9e5ca3150fcc
>
> qemu vm:
> +---------+---------+
> mean (s) |8dcb0ed8 | patched |
> +--------------+---------+---------+
> |cpu, KCSAN on |16.13* |3.75 |
> +--------------+---------+---------+
> |cpu, KCSAN off|4.45 |0.81 |
> +--------------+---------+---------+
> *race condition still present
>
> It's not hammering the lock so much as previous stressor, so the results are better for for-6.17 branch.
> The customer has much bigger scale than 4 cgroups in benchmark.
> There are workarounds implemented so it's not that hot now (for them).
> Anyway, I think it's worth to try improving the scalability situation,
> especially that as far as I see it, there are no downsides.
>
> There also reports about similar problems in memory rstats but I didn't look on them yet.
> Yeah, I saw the benchmark but I was more curious what actual use case would lead to behaviors like that because you'd have to hammer on those stats really hard for this to be a problem. In most use cases that I'm aware of, the polling frequencies of these stats are >= 1sec. I guess the users in your use case were banging on them way harder, at least previously.
>From what I know, the https://github.com/google/cadvisor instances deployed on the client machine hammered these stats.
Sharing servers between independent teams or orgs in big corps is frequent. Every interested party deployed its own, or similar, instance.
We can say just don't do that and be fine, but it will be happening anyway. It's better to just make rstats more robust.