Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

From: Matthew Wilcox
Date: Tue Jan 31 2023 - 00:45:43 EST


On Tue, Jan 31, 2023 at 05:23:52AM +0000, Shakeel Butt wrote:
> On Mon, Jan 30, 2023 at 04:15:09AM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 30, 2023 at 10:32:56AM +0800, kernel test robot wrote:
> > > FYI, we noticed a -19.2% regression of unixbench.score due to commit:
> > >
> > > commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: unixbench
> > > on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > > with following parameters:
> > >
> > > runtime: 300s
> > > nr_task: 30%
> > > test: spawn
> > > cpufreq_governor: performance
> >
> > ...
> >
> > > 9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
> > > ---------------- ---------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 11110 -19.2% 8974 unixbench.score
> > > 1090843 -12.2% 957314 unixbench.time.involuntary_context_switches
> > > 4243909 ± 6% -32.4% 2867136 ± 5% unixbench.time.major_page_faults
> > > 10547 -12.6% 9216 unixbench.time.maximum_resident_set_size
> > > 9.913e+08 -19.6% 7.969e+08 unixbench.time.minor_page_faults
> > > 5638 +19.1% 6714 unixbench.time.system_time
> > > 5502 -20.7% 4363 unixbench.time.user_time
> >
> > So we're spending a lot more time in the kernel and correspondingly less
> > time in userspace.
> >
> > > 67991885 -16.9% 56507507 unixbench.time.voluntary_context_switches
> > > 46198768 -19.1% 37355723 unixbench.workload
> > > 1.365e+08 -12.5% 1.195e+08 ± 7% cpuidle..usage
> > > 1220612 ± 4% -38.0% 757009 ± 28% meminfo.Active
> > > 1220354 ± 4% -38.0% 756754 ± 28% meminfo.Active(anon)
> > > 0.50 ± 2% -0.1 0.45 ± 4% mpstat.cpu.all.soft%
> > > 1.73 -0.2 1.52 ± 2% mpstat.cpu.all.usr%
> > > 532266 -18.4% 434559 vmstat.system.cs
> > > 495826 -12.2% 435455 ± 8% vmstat.system.in
> > > 1.36e+08 -13.2% 1.18e+08 ± 9% turbostat.C1
> > > 68.80 +0.8 69.60 turbostat.C1%
> > > 1.663e+08 -12.1% 1.462e+08 ± 8% turbostat.IRQ
> > > 15.54 ± 20% -49.0% 7.93 ± 24% sched_debug.cfs_rq:/.runnable_avg.min
> > > 13.26 ± 19% -46.6% 7.08 ± 29% sched_debug.cfs_rq:/.util_avg.min
> > > 48.96 ± 8% +51.5% 74.20 ± 13% sched_debug.cfs_rq:/.util_est_enqueued.avg
> > > 138.00 ± 5% +28.9% 177.87 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
> > > 228060 ± 3% +13.3% 258413 ± 4% sched_debug.cpu.avg_idle.stddev
> > > 432533 ± 5% -16.4% 361517 ± 4% sched_debug.cpu.nr_switches.min
> > > 2.665e+08 -18.9% 2.162e+08 numa-numastat.node0.local_node
> > > 2.666e+08 -18.9% 2.163e+08 numa-numastat.node0.numa_hit
> > > 2.746e+08 -20.9% 2.172e+08 numa-numastat.node1.local_node
> > > 2.747e+08 -20.9% 2.172e+08 numa-numastat.node1.numa_hit
> > > 2.602e+08 -17.4% 2.149e+08 numa-numastat.node2.local_node
> > > 2.603e+08 -17.4% 2.149e+08 numa-numastat.node2.numa_hit
> > > 2.423e+08 -15.0% 2.06e+08 numa-numastat.node3.local_node
> > > 2.424e+08 -15.0% 2.061e+08 numa-numastat.node3.numa_hit
> >
> > So we're going off-node a lot more for ... something.
> >
> > > 2.666e+08 -18.9% 2.163e+08 numa-vmstat.node0.numa_hit
> > > 2.665e+08 -18.9% 2.162e+08 numa-vmstat.node0.numa_local
> > > 2.747e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_hit
> > > 2.746e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_local
> > > 2.603e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_hit
> > > 2.602e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_local
> > > 2.424e+08 -15.0% 2.061e+08 numa-vmstat.node3.numa_hit
> > > 2.423e+08 -15.0% 2.06e+08 numa-vmstat.node3.numa_local
> > > 304947 ± 4% -38.0% 189144 ± 28% proc-vmstat.nr_active_anon
> >
> > Umm. Are we running vmstat a lot during this test? The commit says:
> >
> > At the
> > moment the readers are either procfs interface, oom_killer and memory
> > reclaim which I think are not performance critical and should be ok with
> > slow read. However I think we can make that change in a separate patch.
> >
> > This would explain the increased cross-NUMA references (we're going to
> > the other nodes to collect the stats), and the general slowdown. But I
> > don't think it reflects a real workload; it's reflecting that the
> > monitoring of this workload that we're doing is now more accurate and
> > more expensive.
> >
>
> Thanks Willy for taking a stab at this issue. The numa_hit stat is
> updated on allocations, so I don't think stat collection would increase
> these stats. I looked at workload "spawn" in UnixBench and it is a
> simple fork ping pong i.e. process does fork and then waits for the
> child while the child simply exits.

Perhaps I was unclear. What I meant was the monitoring done by the 01.org
team might be the thing which is causing the slowdown they're seeing,
since your commit says that reading the stats becomes more expensive,
and the stats they report are consistent with that.

> I ran perf and it seems like percpu counter allocation is the additional
> cost with this patch. See the report below. However I made spawn a bit
> more sophisticated by adding a mmap() of a GiB then the page table
> copy became the significant cost and no difference without or with the
> given patch.
>
> I am now wondering if this fork ping pong really an important workload
> that we should revert the patch or ignore for now but work on improving
> the performance of __alloc_percpu_gfp code.
>
>
> - 90.97% 0.06% spawn [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
> - 90.91% entry_SYSCALL_64_after_hwframe
> - 90.86% do_syscall_64
> - 80.03% __x64_sys_clone
> - 79.98% kernel_clone
> - 75.97% copy_process
> + 46.04% perf_event_init_task
> - 21.50% copy_mm
> - 10.05% mm_init
> ----------------------> - 8.92% __percpu_counter_init
> - 8.67% __alloc_percpu_gfp
> - 5.70% pcpu_alloc

5.7% of our time spent in pcpu_alloc seems excessive. Are we contending
on pcpu_alloc_mutex perhaps? Also, are you doing this on a 4-socket
machine like the kernel test robot ran on?

We could cut down the number of calls to pcpu_alloc() by a factor of 4
by having a pcpu_alloc_bulk() that would allocate all four RSS counters
at once.

Just throwing out ideas ...