Re: [lkp] [mm] 795ae7a0de: pixz.throughput -9.1% regression

From: Johannes Weiner
Date: Thu Jun 02 2016 - 12:09:41 EST


Hi,

On Thu, Jun 02, 2016 at 02:45:07PM +0800, kernel test robot wrote:
> FYI, we noticed pixz.throughput -9.1% regression due to commit:
>
> commit 795ae7a0de6b834a0cc202aa55c190ef81496665 ("mm: scale kswapd watermarks in proportion to memory")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>
> in testcase: pixz
> on test machine: ivb43: 48 threads Ivytown Ivy Bridge-EP with 64G memory with following parameters: cpufreq_governor=performance/nr_threads=100%

Xiaolong, thanks for the report.

It looks like the regression stems from a change in NUMA placement:

> 3ed3a4f0ddffece9 795ae7a0de6b834a0cc202aa55
> ---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 78505362 ± 0% -9.1% 71324131 ± 0% pixz.throughput
> 4530 ± 0% +1.0% 4575 ± 0% pixz.time.percent_of_cpu_this_job_got
> 14911 ± 0% +2.3% 15251 ± 0% pixz.time.user_time
> 6586930 ± 0% -7.5% 6093751 ± 1% pixz.time.voluntary_context_switches
> 49869 ± 1% -9.0% 45401 ± 0% vmstat.system.cs
> 26406 ± 4% -9.4% 23922 ± 5% numa-meminfo.node0.SReclaimable
> 4803 ± 85% -87.0% 625.25 ± 16% numa-meminfo.node1.Inactive(anon)
> 946.75 ± 3% +775.4% 8288 ± 1% proc-vmstat.nr_alloc_batch
> 2403080 ± 2% -58.4% 999765 ± 0% proc-vmstat.pgalloc_dma32

a bit clearer in the will-it-scale report:

> 3ed3a4f0ddffece9 795ae7a0de6b834a0cc202aa55
> ---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 442409 ± 0% -8.5% 404670 ± 0% will-it-scale.per_process_ops
> 397397 ± 0% -6.2% 372741 ± 0% will-it-scale.per_thread_ops
> 0.11 ± 1% -15.1% 0.10 ± 0% will-it-scale.scalability
> 9933 ± 10% +17.8% 11696 ± 4% will-it-scale.time.involuntary_context_switches
> 5158470 ± 3% +5.4% 5438873 ± 0% will-it-scale.time.maximum_resident_set_size
> 10701739 ± 0% -11.6% 9456315 ± 0% will-it-scale.time.minor_page_faults
> 825.00 ± 0% +7.8% 889.75 ± 0% will-it-scale.time.percent_of_cpu_this_job_got
> 2484 ± 0% +7.8% 2678 ± 0% will-it-scale.time.system_time
> 81.98 ± 0% +8.7% 89.08 ± 0% will-it-scale.time.user_time
> 848972 ± 1% -13.3% 735967 ± 0% will-it-scale.time.voluntary_context_switches
> 19395253 ± 0% -20.0% 15511908 ± 0% numa-numastat.node0.local_node
> 19400671 ± 0% -20.0% 15518877 ± 0% numa-numastat.node0.numa_hit

The way this test is set up (in-memory compression on 48 nodes) I'm
surprised we spill over, though, even with the higher watermarks.

Xiaolong, could you provide the full /proc/zoneinfo of that machine
right before the test is running? I wonder if it's mostly filled with
cache, and the increase in watermarks causes a higher portion of the
anon allocs and frees to spill to the remote node, but never enough to
enter the allocator slowpath and waking kswapd to fix it.

Another suspect is the fair zone allocator, whose allocation batches
increased as well. It shouldn't affect NUMA placement, but I wonder if
there is a bug in there that causes false spilling to foreign nodes
that was only bounded by the allocation batch of the foreign zone.
Mel, does such a symptom sound familiar in any way?

I'll continue to investigate.