Re: [sched/core] 2ebb177175: will-it-scale.per_thread_ops -3.7% regression

From: Mel Gorman
Date: Tue Jun 16 2020 - 08:33:27 EST


On Tue, Jun 16, 2020 at 03:39:08PM +0800, kernel test robot wrote:
>
> FYI, we noticed a -3.7% regression of will-it-scale.per_thread_ops due to commit:
>

Thanks.

>
> commit: 2ebb17717550607bcd85fb8cf7d24ac870e9d762 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: will-it-scale
> on test machine: 8 threads Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz with 16G memory
> with following parameters:
>
> <SNIP>
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+--------------------------------------------------------------------------+
> | testcase: change | lmbench3: lmbench3.AF_UNIX.sock.stream.bandwidth.MB/sec 5.1% improvement |
> | test machine | 16 threads Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz with 48G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=development |
> | | nr_threads=50% |
> | | test=UNIX |
> | | test_memory_size=50% |
> | | ucode=0x7000019 |
> +------------------+--------------------------------------------------------------------------+
> | testcase: change | vm-scalability: vm-scalability.throughput 4.6% improvement |
> | test machine | 16 threads Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz with 32G memory |
> | test parameters | cpufreq_governor=performance |
> | | runtime=300s |
> | | size=2T |
> | | test=shm-pread-seq-mt |
> | | ucode=0xca |
> +------------------+--------------------------------------------------------------------------+
>

This is telling me that there are both wins and losses due to the
patch. That is not entirely unexpected given the nature of the patch and
how it impacts timings.

> c6e7bd7afaeb3af5 2ebb17717550607bcd85fb8cf7d
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :4 25% 1:4 dmesg.RIP:find_vma
> 1:4 -25% :4 dmesg.RIP:loop
> :4 25% 1:4 dmesg.RIP:poll_idle
> :4 25% 1:4 dmesg.RIP:release_pages
> :4 25% 1:4 kmsg.a562403>]usb_hcd_irq
> 1:4 -25% :4 kmsg.a9be0b>]usb_hcd_irq
> :4 25% 1:4 kmsg.c867d3b>]usb_hcd_irq
> :4 25% 1:4 kmsg.e2bf9>]usb_hcd_irq
> 1:4 -25% :4 kmsg.e6afaad>]usb_hcd_irq
> 1:4 -25% :4 kmsg.ef22>]usb_hcd_irq
> %stddev %change %stddev
> \ | \
> 1564377 -3.7% 1506585 will-it-scale.per_thread_ops
> 12515025 -3.7% 12052684 will-it-scale.workload

What is the variability here? The stddev is blank.

> 7295 ± 18% +32.3% 9651 ± 3% slabinfo.kmalloc-32.active_objs
> 7295 ± 18% +32.3% 9651 ± 3% slabinfo.kmalloc-32.num_objs
> 52712 ±167% -99.8% 122.25 ± 9% softirqs.CPU7.NET_RX
> 167833 ± 37% -96.0% 6770 ± 70% softirqs.NET_RX

This is curious, why was there a storm of network traffic during the
baseline run? It may not matter but I find it interesting that there was
different background traffic during the two tests.

> 8.17 -6.2 1.95 ± 2% mpstat.cpu.all.idle%
> 0.05 ± 39% -0.0 0.00 ± 92% mpstat.cpu.all.soft%
> 36.39 +4.6 40.97 mpstat.cpu.all.usr%
> 54.75 +2.3% 56.00 vmstat.cpu.sy
> 36.00 +11.8% 40.25 vmstat.cpu.us
> 879009 +9.8% 964967 vmstat.system.cs
> 80622 +438.9% 434480 vmstat.system.in
> 8567964 ± 11% -24.3% 6482942 ± 4% cpuidle.C1.time
> 7399663 ± 10% -46.7% 3944245 ± 5% cpuidle.C1.usage
> 4762729 ± 15% -76.7% 1108338 ± 4% cpuidle.C1E.time
> 2825633 ± 12% -92.5% 213227 ± 11% cpuidle.C1E.usage
> 1392828 ± 18% -89.8% 142380 ± 13% cpuidle.C3.usage
> 9138464 ± 19% -39.7% 5510711 ± 12% cpuidle.C6.time
> 49832321 -80.0% 9988732 ± 5% cpuidle.POLL.time
> 67398769 ± 2% -89.1% 7335934 ± 6% cpuidle.POLL.usage

This generally indicates that CPUs are spending less time in C-states
and more likely to be active. In itself, this is not a bad thing.

> 18365719 -91.7% 1533497 ± 4% interrupts.RES:Rescheduling_interrupts
> 103.25 ± 5% +159.1% 267.50 ± 5% interrupts.TLB:TLB_shootdowns
> 13.47 ± 3% -19.3% 10.86 ± 3% perf-stat.i.MPKI
> 1.746e+09 +1.9% 1.779e+09 perf-stat.i.branch-instructions
> 47675183 +1.8% 48541350 perf-stat.i.branch-misses
> 1.145e+08 ± 3% -18.3% 93535517 ± 3% perf-stat.i.cache-references
> 885445 +9.7% 971331 perf-stat.i.context-switches
> 3.43 -1.1% 3.39 perf-stat.i.cpi
> 2.155e+09 +3.4% 2.228e+09 perf-stat.i.dTLB-loads
> 1.73e+09 +5.6% 1.827e+09 perf-stat.i.dTLB-stores
> 5869850 ± 5% -16.5% 4900813 perf-stat.i.iTLB-load-misses
> 786437 ± 13% -30.9% 543649 ± 16% perf-stat.i.iTLB-loads
> 8.535e+09 +1.3% 8.642e+09 perf-stat.i.instructions
> 1492 ± 4% +20.6% 1798 perf-stat.i.instructions-per-iTLB-miss
> 0.29 +1.1% 0.30 perf-stat.i.ipc
> 720.16 +3.2% 742.92 perf-stat.i.metric.M/sec
> 13.41 ± 3% -19.3% 10.82 ± 3% perf-stat.overall.MPKI
> 3.42 -1.1% 3.39 perf-stat.overall.cpi
> 1457 ± 4% +21.0% 1763 perf-stat.overall.instructions-per-iTLB-miss
> 0.29 +1.1% 0.30 perf-stat.overall.ipc
> 205548 +5.0% 215927 perf-stat.overall.path-length
> 1.74e+09 +1.9% 1.773e+09 perf-stat.ps.branch-instructions
> 47519673 +1.8% 48383718 perf-stat.ps.branch-misses
> 1.141e+08 ± 3% -18.3% 93221019 ± 3% perf-stat.ps.cache-references
> 882476 +9.7% 968061 perf-stat.ps.context-switches
> 2.148e+09 +3.4% 2.221e+09 perf-stat.ps.dTLB-loads
> 1.724e+09 +5.6% 1.821e+09 perf-stat.ps.dTLB-stores
> 5850190 ± 5% -16.5% 4884334 perf-stat.ps.iTLB-load-misses
> 783812 ± 13% -30.9% 541830 ± 16% perf-stat.ps.iTLB-loads
> 8.507e+09 +1.3% 8.614e+09 perf-stat.ps.instructions
> 2.572e+12 +1.2% 2.602e+12 perf-stat.total.instructions

I'm not getting much out of this. I was hoping to see something on
dcache hit/miss rates but only cache-references are mentioned. I think
what is happening is that the threads are staying active to contend on a
shared mutex more and incurring more cache misses as a result.

> 8.95 ± 2% -6.5 2.50 ± 3% perf-profile.calltrace.cycles-pp.try_to_wake_up.wake_up_q.futex_wake.do_futex.__x64_sys_futex
> 9.15 ± 2% -6.4 2.79 ± 3% perf-profile.calltrace.cycles-pp.wake_up_q.futex_wake.do_futex.__x64_sys_futex.do_syscall_64

Less time spend in try_to_wake_up which is not surprising given the
patch is meant to avoid spinning in that path.

> 22.31 -4.6 17.69 perf-profile.calltrace.cycles-pp.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe.__lll_unlock_wake
> 21.97 -4.6 17.37 perf-profile.calltrace.cycles-pp.futex_wake.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 40.70 -4.6 36.13 perf-profile.calltrace.cycles-pp.__lll_unlock_wake
> 23.13 -4.6 18.56 perf-profile.calltrace.cycles-pp.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe.__lll_unlock_wake
> 30.54 -4.5 26.03 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__lll_unlock_wake
> 31.06 -4.5 26.59 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__lll_unlock_wake
> 2.81 ± 5% -2.4 0.40 ± 58% perf-profile.calltrace.cycles-pp.menu_select.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
> 3.15 ± 10% -1.9 1.28 ± 10% perf-profile.calltrace.cycles-pp.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
> 3.11 ± 10% -1.8 1.26 ± 10% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary

Less going into idle, generally good.

Overall, I do not think there is something worth digging into here.
Two other reports showed performance gains and the data for this profile
shows that the threads are likely spending more time being active so
they can contend on the mutex more aggressively. Having multiple threads
contend on the same mutex in a loop looks like an anti-pattern and I do
not think we want to artificially starve/stall threads just to make that
particular figure look good.

--
Mel Gorman
SUSE Labs