The kernel currently faces scalability issues when multiple userspace
programs attempt to read cgroup statistics concurrently.
The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
which prevents access and updates to the statistics
of the css from multiple CPUs in parallel.
Given that rstat operates on a per-CPU basis and only aggregates
statistics in the parent cgroup, there is no compelling reason
why these statistics cannot be atomic.
By eliminating the lock, each CPU can traverse its rstat hierarchy
independently, without blocking. Synchronization is achieved during
parent propagation through atomic operations.
[..]
This change significantly enhances performance in scenarios
where multiple CPUs access CPU rstat within a single cgroup hierarchy,
yielding a performance improvement of around 50 times compared
to the mainline version.
Notably, performance for memory and I/O rstats remains unchanged,
as these are managed in separate submodules.
Additionally, this patch addresses a race condition detectable
in the current mainline by KCSAN in __cgroup_account_cputime,
which occurs when attempting to read a single hierarchy
from multiple CPUs.
Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@xxxxxxxxx>
---
Benchmark code: https://gist.github.com/bwlodarcz/c955b36b5667f0167dffcff23953d1da
Tested on Intel(R) Xeon(R) Platinum 8468V, 2s 48c 2tpc, 377GiB RAM, Fedora 41:
+--------+-------+
|Mainline|Patched|
+--------+-------+
|369.95s |6.52s |
+--------+-------+
@@ -820,7 +813,6 @@ struct cgroup_subsys {
*/
unsigned int depends_on;
- spinlock_t rstat_ss_lock;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index cbeaa499a96a..36af2b883440 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,7 +9,6 @@
#include <trace/events/cgroup.h>
-static DEFINE_SPINLOCK(rstat_base_lock);
static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -37,14 +36,6 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
}
-static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
-{
- if (ss)
- return &ss->rstat_ss_lock;
-
- return &rstat_base_lock;
-}