[PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

From: Leonardo Bras
Date: Wed Jan 25 2023 - 02:36:43 EST


Disclaimer:
a - The cover letter got bigger than expected, so I had to split it in
sections to better organize myself. I am not very confortable with it.
b - Performance numbers below did not include patch 5/5 (Remove flags
from memcg_stock_pcp), which could further improve performance for
drain_all_stock(), but I could only notice the optimization at the
last minute.


0 - Motivation:
On current codebase, when drain_all_stock() is ran, it will schedule a
drain_local_stock() for each cpu that has a percpu stock associated with a
descendant of a given root_memcg.

This happens even on 'isolated cpus', a feature commonly used on workloads that
are sensitive to interruption and context switching such as vRAN and Industrial
Control Systems.

Since this scheduling behavior is a problem to those workloads, the proposal is
to replace the current local_lock + schedule_work_on() solution with a per-cpu
spinlock.

1 - Possible performance losses:
With a low amount of shedule_work_on(), local_locks are supposed to perform
better than spinlocks, given cacheline is always accessed by a single CPU and
there is no contention. The impact on those areas is analyzed bellow:

1.1 - Cacheline usage
In current implementation drain_all_stock() will be remote reading the percpu
memcg_stock cacheline of every online CPU, and remote writing to all cpus that
succeed the mem_cgroup_is_descendant() test. (stock->flags, stock->work)

With spinlocks, drain_all_stock() will be remote reading the percpu memcg_stock
cacheline of every online CPU, and remote writing to all cpus that succeed the
mem_cgroup_is_descendant() test. (stock->stock_lock, on top of the above)
While the spinlock may require extra acquire/release writes on some archs, they
will all happen on an exclusive cacheline, so not much overhead.

In both cases, the next local cpu read will require a fetch from memory (as it
was invalidated on the remote write) and cacheline exclusivity get before the
local write.

So about cacheline usage, there should not be much impact.

1.2 - Contention
We can safely assume that drain_all_stock() will not run oftenly. If it was not
the case, there would be a lot of scheduled tasks and kill cpu performance.

Since it does not run oftenly, and it is the only function that acesses remote
percpu memcg_stock, contention should not be too expressive, and should cause
less impact than scheduling (remote) and running the scheduled work (local).


2 - Performance test results:
2.1 - Test System:
- Non-virtualized AMD EPYC 7601 32-Core Processor, 128 CPUs distributed
in 8 NUMA nodes: 0-7,64-71 on node0, 8-15,72-79 on node1, and so on.
- 256GB RAM, 2x32GB per NUMA node
- For cpu isolation: use kernel cmdline:
isolcpus=8-15,72-79 nohz_full=8-15,72-79
- Memcg group created with:
[Slice]
MemoryAccounting=true
MemoryLimit=1G
MemoryMax=1G
MemorySwapMax=1M
- For pinning runs on given CPU, I used 'taskset -c $cpunum command'
- For restarting the memcg, it was used:
restart_memcg(){
systemctl stop test.slice
systemctl daemon-reload
systemctl start test.slice
}

2.2 - Impact on functions that use memcg_stock:
2.2.1 - Approach: Using perf or tracepoints get a very course result, so
it was preferred to count the total cpu clocks between entering and
exiting the functions that use the memcg_stock, on an isolated cpu.
Something like this was used on x86:

fun_name(){
u64 clk = rdtsc_ordered();
... <function does it's job>
clk = rdtsc_ordered() - clk;
<percpu struct dbg>
dbg->fun_name.cycles += clk;
dbg->fun_name.count++;
}

The percpu statistics were then acquired via a char device after the
test finished, and an average function clock usage was calculated.

For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
of 1000000 iterations inside the memcg for each cpu tested:
for each cpu in $cpuset; do
systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
for k in {1..100000} ; do
cat /proc/interrupts > /dev/null;
done"

For the drain_all_stock() test, it was necessary to restart the memcg
(or cause an OOM) to call the function.

2.2.2 - Results
For 1 isolated CPU, pinned on cpu 8, with no drain_all_stock() calls,
being STDDEV the standard deviation between the average on 6 runs,
and Call Count the sum of calls on the 6 runs:

Patched Average clk STDDEV Call Count
consume_stock: 63.75983475 0.1610502136 72167768
refill_stock: 67.45708322 0.09732816852 23401756
mod_objcg_state: 98.03841384 1.491628532 181292961
consume_obj_stock: 63.2543456 0.04624513799 94846454
refill_obj_stock: 78.56390025 0.3160306174 91732973

Upstream Average clk STDDEV Call Count
consume_stock: 53.51201046 0.05200824438 71.866536
refill_stock: 65.46991584 0.1178078417 23401752
mod_objcg_state: 84.95365055 1.371464414 181.292707
consume_obj_stock: 60.03619438 0.05944582207 94.846327
refill_obj_stock: 73.23757912 1.161933856 91.732787

Patched - Upstream Diff (cycles) Diff %
consume_stock: 10.24782429 19.15051258
refill_stock: 1.987167378 3.035237411
mod_objcg_state: 13.08476328 15.40223781
consume_obj_stock: 3.218151217 5.360351785
refill_obj_stock: 5.326321123 7.272661368

So in average the above patched functions are 2~13 clocks cycles slower
than upstream.

On the other hand, drain_all_stock is faster on the patched version,
even considering it does all the draining instead of scheduling the work
to other CPUs:

drain_all_stock
cpus Upstream Patched Diff (cycles) Diff(%)
1 44331.10831 38978.03581 -5353.072507 -12.07520567
8 43992.96512 39026.76654 -4966.198572 -11.2886198
128 156274.6634 58053.87421 -98220.78915 -62.85138425

(8 cpus being in the same NUMA node)

2.3 - Contention numbers
2.3.1 - Approach
On top of the patched version, I replaced the spin_lock_irqsave() on
functions that use the memcg_stock with spin_lock_irqsave_cc(), which
is defined as:

#define spin_lock_irqsave_cc(l, flags) \
if (!spin_trylock_irqsave(l, flags)) { \
u64 clk = rdtsc_ordered(); \
spin_lock_irqsave(l, flags); \
clk = rdtsc_ordered() - clk; \
pr_err("mydebug: cpu %d hit contention :" \
" fun = %s, clk = %lld/n", \
smp_processor_id(), __func__, clk); \
}

So in case of contention (try_lock failing) it would record an
approximate clk usage before getting the lock, and print this to dmesg.

For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
of 1000000 iterations for each of the 128 cpus inside the memcg
(limit set to 20G):

for each cpu in {1..128}; do
restart_memcg()
systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
for k in {1..100000} ; do
cat /proc/interrupts > /dev/null;
done"

This loop was repeated for over 8 hours

2.3.2- Results

Function # calls
consume_stock: 15078323802
refill_stock: 2495995683
mod_objcg_state: 39765559905
consume_obj_stock: 19882888224
refill_obj_stock: 21025241793
drain_all_stock: 592

Contentions hit: 0

(Other more aggressive synthetic tests were run, and even in this case
contention was hit just a couple times over some hours.)

2.4 - Syscall time measure
2.4.1- Approach
To measure the patchset effect on syscall time, the following code was
used: (copied/adapted from
https://lore.kernel.org/linux-mm/20220924152227.819815-1-atomlin@xxxxxxxxxx/ )

################
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>

typedef unsigned long long cycles_t;
typedef unsigned long long usecs_t;
typedef unsigned long long u64;

#ifdef __x86_64__
#define DECLARE_ARGS(val, low, high) unsigned long low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)
#else
#define DECLARE_ARGS(val, low, high) unsigned long long val
#define EAX_EDX_VAL(val, low, high) (val)
#define EAX_EDX_ARGS(val, low, high) "A" (val)
#define EAX_EDX_RET(val, low, high) "=A" (val)
#endif

static inline unsigned long long __rdtscll(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

#define rdtscll(val) do { (val) = __rdtscll(); } while (0)

#define NRSYSCALLS 30000000
#define NRSLEEPS 100000

#define page_mmap() mmap(NULL, 4096, PROT_READ|PROT_WRITE, \
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
#define page_munmap(x) munmap(x, 4096)

void main(int argc, char *argv[])
{
unsigned long a, b, cycles;
int i, syscall = 0;
int *page;

page = page_mmap();
if (page == MAP_FAILED)
perror("mmap");

if (page_munmap(page))
perror("munmap");

if (argc != 2) {
printf("usage: %s {idle,syscall}\n", argv[0]);
exit(1);
}

rdtscll(a);

if (strncmp("idle", argv[1], 4) == 0)
syscall = 0;
else if (strncmp("syscall", argv[1], 7) == 0)
syscall = 1;
else {
printf("usage: %s {idle,syscall}\n", argv[0]);
exit(1);
}

if (syscall == 1) {
for (i = 0; i < NRSYSCALLS; i++) {
page = page_mmap();
if (page == MAP_FAILED)
perror("mmap");

#ifdef MY_WRITE
page[3] = i;
#endif

if (page_munmap(page))
perror("munmap");
}
} else {
for (i = 0; i < NRSLEEPS; i++)
usleep(10);
}

rdtscll(b);

cycles = b - a;

if (syscall == 1)
printf("cycles / syscall: %d\n", (b-a)/(NRSYSCALLS*2));
else
printf("cycles / idle loop: %d\n", (b-a)/NRSLEEPS);
}
################

Running with ./my_test syscall will cause it to print the average clock
cycles usage of the syscall pair (page_mmap() and page_munmap());

It was compiled with two versions: With -DMY_WRITE and without it.
The difference is writing to the allocated page, causing it to fault.

Each version was run 200 times, pinned to an isolated cpu.
Then an average and standard deviation was calculated on those results.

2.4.2- Results
Patched: no_write write
AVG 2991.195 5746.495
STDEV 27.77488427 40.55878512
STDEV % 0.9285547838 0.7058004073

Upstream: no_write write
AVG 3012.22 5749.605
STDEV 25.1186451 37.26206223
STDEV % 0.8338914522 0.6480803851

Pat - Up: no_write write
Diff -21.025 -3.11
Diff % -0.6979901866 -0.05409067232

Meaning the pair page_mmap() + page_munmap() + pagefault runs a tiny bit
faster on the patched version, compared to upstream.

3 - Discussion on results
On 2.2 we see every function that uses memcg_stock on local cpu gets a
slower by some cycles on the patched version, while the function that
accesses it remotely (drain_all_stock()) gets faster. The difference is
more accentuated as we raise the cpu count, and consequently start
dealing with sharing memory across NUMA.

On 2.3 we see contention is not a big issue, as expected in 1.2.
This probably happens due to the fact that drain_all_stock() runs quite
rarely on normal operation, and the other functions are quite fast.

On 2.4 we can see that page_mmap() + page_munmap() + pagefault ran
a tiny bit faster. This is probably due to the fact that
drain_all_stock() does not schedule work on the running cpu, causing it
not to get interrupted, and possibly making up for the increased time
in local functions.

4- Conclusion
Scheduling work on isolated cpus can be an issue for some workloads.
Reducing the issue by replacing the local_lock in memcg_stock with a
spinlock should not cause much impact on performance.

Leonardo Bras (5):
mm/memcontrol: Align percpu memcg_stock to cache
mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes
mm/memcontrol: Perform all stock drain in current CPU
mm/memcontrol: Remove flags from memcg_stock_pcp

mm/memcontrol.c | 75 ++++++++++++++++++++-----------------------------
1 file changed, 31 insertions(+), 44 deletions(-)

--
2.39.1