Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

From: Roman Gushchin
Date: Wed Feb 08 2023 - 14:24:25 EST


On Tue, Feb 07, 2023 at 12:18:01AM -0300, Leonardo Brás wrote:
> On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote:
> > Hi Leonardo!
>
> Hello Roman,
> Thanks a lot for replying!
>
> >
> > > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> > > contention, for a "sometimes we hit spinlock contention".
> > >
> > > For the spinlock proposal, on the local cpu side, the *worst case* contention
> > > is:
> > > 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> > > 2 - wait a cache hit for local per-cpu cacheline 
> > >
> > > What is current implemented (schedule_work_on() approach), for the local
> > > cpu side there is *always* this contention:
> > > 1 - wait for a context switch,
> > > 2 - wait a cache hit from it's local per-cpu cacheline,
> > > 3 - wait a complete <percpu cache drain process>, 
> > > 4 - then for a new context switch to the current thread.
> >
> > I think both Michal and me are thinking of a more generic case in which the cpu
> > is not exclusively consumed by 1 special process, so that the draining work can
> > be executed during an idle time. In this case the work is basically free.
>
> Oh, it makes sense.
> But in such scenarios, wouldn't the same happens to spinlocks?
>
> I mean, most of the contention with spinlocks only happens if the remote cpu is
> trying to drain the cache while the local cpu happens to be draining/charging,
> which is quite rare due to how fast the local cpu operations are.
>
> Also, if the cpu has some idle time, using a little more on a possible spinlock
> contention should not be a problem. Right?
>
> >
> > And the introduction of a spin_lock() on the hot path is what we're are concerned
> > about. I agree, that on some hardware platforms it won't be that expensive, 
> >
>
> IIRC most hardware platforms with multicore supported by the kernel should have
> the same behavior, since it's better to rely on cache coherence than locking the
> memory bus.
>
> For instance, the other popular architectures supported by Linux use the LR/SC
> strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the
> LoadReserve slow part waits for the cacheline exclusivity, which is already
> already exclusive in this perCPU structure.
>
>
> > but in general not having any spinlocks is so much better.
>
> I agree that spinlocks may bring contention, which is not ideal in many cases.
> In this case, though, it may not be a big issue, due to very rare remote access
> in the structure, for the usual case (non-pre-OOMCG)

Hi Leonardo!

I made a very simple test: replaced pcp local_lock with a spinlock and ran
a very simple kernel memory accounting benchmark (attached below) on
my desktop pc (Intel Core i9-7940X).

Original (3 attempts):
81341 us
80973 us
81258 us

Patched (3 attempts):
99918 us
100220 us
100291 us

This is without any contention and without CONFIG_LOCKDEP.

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 49f67176a1a2..bafd3cde4507 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6563,6 +6563,37 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}

+static int memory_alloc_test(struct seq_file *m, void *v)
+{
+ unsigned long i, j;
+ void **ptrs;
+ ktime_t start, end;
+ s64 delta, min_delta = LLONG_MAX;
+
+ ptrs = kvmalloc(sizeof(void *) * 1024 * 1024, GFP_KERNEL);
+ if (!ptrs)
+ return -ENOMEM;
+
+ for (j = 0; j < 100; j++) {
+ start = ktime_get();
+ for (i = 0; i < 1024 * 1024; i++)
+ ptrs[i] = kmalloc(8, GFP_KERNEL_ACCOUNT);
+ end = ktime_get();
+
+ delta = ktime_us_delta(end, start);
+ if (delta < min_delta)
+ min_delta = delta;
+
+ for (i = 0; i < 1024 * 1024; i++)
+ kfree(ptrs[i]);
+ }
+
+ kvfree(ptrs);
+ seq_printf(m, "%lld us\n", min_delta);
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
int item)
@@ -6758,6 +6789,10 @@ static struct cftype memory_files[] = {
.name = "stat",
.seq_show = memory_stat_show,
},
+ {
+ .name = "test",
+ .seq_show = memory_alloc_test,
+ },
#ifdef CONFIG_NUMA
{
.name = "numa_stat",