Re: [LKP] Re: [sched/numa] bb2dee337b: unixbench.score -11.2% regression

From: Ying Huang
Date: Fri May 20 2022 - 02:44:40 EST

Next message: Dan Carpenter: "[PATCH v3] nvmem: brcm_nvram: check for allocation failure"
Previous message: Jiapeng Chong: "[PATCH v2] xfs: Remove dead code"
In reply to: ying.huang@xxxxxxxxx: "Re: [sched/numa] bb2dee337b: unixbench.score -11.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2022-05-19 at 15:54 +0800, ying.huang@xxxxxxxxx wrote:
> Hi, Mel,
>
> On Wed, 2022-05-18 at 16:22 +0100, Mel Gorman wrote:
> > On Wed, May 18, 2022 at 05:24:14PM +0800, kernel test robot wrote:
> > >
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -11.2% regression of unixbench.score due to commit:
> > >
> > >
> > > commit: bb2dee337bd7d314eb7c7627e1afd754f86566bc ("[PATCH 3/4] sched/numa: Apply imbalance limitations consistently")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Mel-Gorman/Mitigate-inconsistent-NUMA-imbalance-behaviour/20220511-223233
> > > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d70522fc541224b8351ac26f4765f2c6268f8d72
> > > patch link: https://lore.kernel.org/lkml/20220511143038.4620-4-mgorman@xxxxxxxxxxxxxxxxxxx
> > >
> > > in testcase: unixbench
> > > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
> > > with following parameters:
> > >
> > > runtime: 300s
> > > nr_task: 1
> > > test: shell8
> > > cpufreq_governor: performance
> > > ucode: 0xd000331
> > >
> > > test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> > > test-url: https://github.com/kdlucas/byte-unixbench
> >
> > I think what is happening for unixbench is that it prefers to run all
> > instances on a local node if possible. shell8 is creating 8 scripts,
> > each of which spawn more processes. The total number of tasks may exceed
> > the allowed imbalance at fork time of 16 tasks. Some spill over to a
> > remote node and as they are using files, some accesses are remote and it
> > suffers. It's not memory bandwidth bound but is sensitive to locality.
> > The stats somewhat support this idea
> >
> > >      83590 ± 13% -73.7% 21988 ± 32% numa-meminfo.node0.AnonHugePages
> > >     225657 ± 18% -58.0% 94847 ± 18% numa-meminfo.node0.AnonPages
> > >     231652 ± 17% -55.3% 103657 ± 16% numa-meminfo.node0.AnonPages.max
> > >     234525 ± 17% -55.5% 104341 ± 18% numa-meminfo.node0.Inactive
> > >     234397 ± 17% -55.5% 104267 ± 18% numa-meminfo.node0.Inactive(anon)
> > >      11724 ± 7% +17.5% 13781 ± 5% numa-meminfo.node0.KernelStack
> > >       4472 ± 34% +117.1% 9708 ± 31% numa-meminfo.node0.PageTables
> > >      15239 ± 75% +401.2% 76387 ± 10% numa-meminfo.node1.AnonHugePages
> > >      67256 ± 63% +206.3% 205994 ± 6% numa-meminfo.node1.AnonPages
> > >      73568 ± 58% +193.1% 215644 ± 6% numa-meminfo.node1.AnonPages.max
> > >      75737 ± 53% +183.9% 215053 ± 6% numa-meminfo.node1.Inactive
> > >      75709 ± 53% +183.9% 214971 ± 6% numa-meminfo.node1.Inactive(anon)
> > >       3559 ± 42% +187.1% 10216 ± 8% numa-meminfo.node1.PageTables
> >
> > There is less memory used on one node and more on the other so it's
> > getting split.
>
> This makes sense. I will also check CPU utilization per node to verify
> this directly.

I run this workload 3 times for the commit and its parent with mpstat
node statistics.

For the parent commit,

"mpstat.node.0.usr%": [
0.1396875,
3.0806153846153848,
0.05303030303030303
],
"mpstat.node.0.sys%": [
0.10515625,
5.597692307692308,
0.1340909090909091
],

"mpstat.node.1.usr%": [
3.1015625,
0.1306153846153846,
3.0275757575757574
],
"mpstat.node.1.sys%": [
5.66703125,
0.11676923076923076,
5.498181818181818
],

The difference between two nodes are quite large.

For the commit,

"mpstat.node.0.usr%": [
1.42109375,
1.4725,
1.5140625
],
"mpstat.node.0.sys%": [
3.00125,
3.16390625,
3.1284375
],

"mpstat.node.1.usr%": [
1.4909375,
1.41609375,
1.3740625
],
"mpstat.node.1.sys%": [
3.1671875,
3.00109375,
3.044375
],

The difference between 2 nodes reduces greatly. So this proves your
theory directly.

Best Regards,
Huang, Ying

[snip]

Next message: Dan Carpenter: "[PATCH v3] nvmem: brcm_nvram: check for allocation failure"
Previous message: Jiapeng Chong: "[PATCH v2] xfs: Remove dead code"
In reply to: ying.huang@xxxxxxxxx: "Re: [sched/numa] bb2dee337b: unixbench.score -11.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]