Re: [smb3] 5efdd9122e: filebench.sum_operations/s -50.0% regression

From: Steve French
Date: Thu Mar 23 2023 - 16:36:39 EST


Patch attached to return default back to 1 second from 5 seconds
(Bharath's recent closetimeo on lease break patch may also help)

Let me know if you see any problems with it

On Thu, Mar 23, 2023 at 2:48 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>
> On 3/15/23 22:26, Steve French wrote:
> > Can you verify what this perf looks like with "closetime=0" and "closetime=1"
> >
> > Are there differences in /proc/fs/cifs/Stats when you run the same
> > steps with "closetimeo=1" vs. the more recent default (5 seconds)?
> The dump of /proc/fs/cifs/Stats are attached. You can tell which
> closetimeo is used from the file name. Thanks.
>
>
> Regards
> Yin, Fengwei
>
> >
> > On Wed, Mar 15, 2023 at 2:46 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 8/26/2022 10:41 AM, kernel test robot wrote:
> >>> Greeting,
> >>>
> >>> FYI, we noticed a -50.0% regression of filebench.sum_operations/s due to commit:
> >>>
> >>>
> >>> commit: 5efdd9122eff772eae2feae9f0fc0ec02d4846a3 ("smb3: allow deferred close timeout to be configurable")
> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>>
> >>> in testcase: filebench
> >>> on test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz (Cascade Lake) with 128G memory
> >>> with following parameters:
> >>>
> >>> disk: 1HDD
> >>> fs: ext4
> >>> fs2: cifs
> >>> test: filemicro_delete.f
> >>> cpufreq_governor: performance
> >>> ucode: 0x5003302
> >> Please note, we still could see this regresion on v6.3-rc2. And the regression is related with
> >> the commit: 5efdd9122eff772eae2feae9f0fc0ec02d4846a3. This commit changed the default timeout
> >> value from 1s to 5s. If change the timeout back to 1s as following:
> >>
> >> diff --git a/fs/cifs/fs_context.c b/fs/cifs/fs_context.c
> >> index 6d13f8207e96..6b930fb0c4bd 100644
> >> --- a/fs/cifs/fs_context.c
> >> +++ b/fs/cifs/fs_context.c
> >> @@ -1537,7 +1537,7 @@ int smb3_init_fs_context(struct fs_context *fc)
> >>
> >> ctx->acregmax = CIFS_DEF_ACTIMEO;
> >> ctx->acdirmax = CIFS_DEF_ACTIMEO;
> >> - ctx->closetimeo = SMB3_DEF_DCLOSETIMEO;
> >> + ctx->closetimeo = CIFS_DEF_ACTIMEO;
> >>
> >> /* Most clients set timeout to 0, allows server to use its default */
> >> ctx->handle_timeout = 0; /* See MS-SMB2 spec section 2.2.14.2.12 */
> >>
> >> The regression is gone:
> >> dcb45fd7f501f864 v6.3-rc2 32715af441411a5a266606be08f
> >> ---------------- --------------------------- ---------------------------
> >> fail:runs %reproduction fail:runs %reproduction fail:runs
> >> | | | | |
> >> :25 0% :3 33% 1:3 last_state.booting
> >> :25 0% :3 33% 1:3 last_state.is_incomplete_run
> >> %stddev %change %stddev %change %stddev
> >> \ | \ | \
> >> 515.95 -50.0% 257.98 -0.0% 515.92 filebench.sum_operations/s
> >> 5.04 ± 7% +833.7% 47.09 ± 2% -2.9% 4.90 ± 2% filebench.sum_time_ms/op
> >> 10438 -100.0% 0.33 ±141% -100.0% 0.50 ±100% filebench.time.major_page_faults
> >> 167575 -4.1% 160660 -4.4% 160140 filebench.time.maximum_resident_set_size
> >> 7138 ± 11% +141.4% 17235 ± 3% +147.6% 17677 filebench.time.minor_page_faults
> >> 28.68 ± 9% +199.9% 86.00 ± 7% -2.4% 28.00 filebench.time.percent_of_cpu_this_job_got
> >> 2453485 ± 54% -63.0% 907380 -66.2% 830273 ± 6% cpuidle..usage
> >> 0.61 ± 38% +0.8 1.41 ± 3% +0.2 0.80 ± 4% mpstat.cpu.all.sys%
> >> 142984 ± 13% -45.6% 77725 -47.5% 75106 vmstat.system.in
> >> 34.23 ± 7% +27.9% 43.79 +27.8% 43.74 boot-time.boot
> >> 17.09 ± 11% +66.0% 28.38 +65.5% 28.28 boot-time.dhcp
> >> 2661 ± 7% +37.5% 3659 +37.2% 3651 boot-time.idle
> >> 104737 ±185% -87.8% 12762 ± 10% -89.8% 10631 ± 4% turbostat.C1
> >>
> >>
> >> 32715af441411a5a266606be08f is v6.3-rc2 with the change to restore the timeout to 1s. Thanks.
> >>
> >> Regards
> >> Yin, Fengwei
> >>
> >>
> >>>
> >>>
> >>> =========================================================================================
> >>> compiler/cpufreq_governor/disk/fs2/fs/kconfig/rootfs/tbox_group/test/testcase/ucode:
> >>> gcc-11/performance/1HDD/cifs/ext4/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-csl-2sp9/filemicro_delete.f/filebench/0x5003302
> >>>
> >>> commit:
> >>> dcb45fd7f5 ("cifs: Do not use tcon->cfid directly, use the cfid we get from open_cached_dir")
> >>> 5efdd9122e ("smb3: allow deferred close timeout to be configurable")
> >>>
> >>> dcb45fd7f501f864 5efdd9122eff772eae2feae9f0f
> >>> ---------------- ---------------------------
> >>> %stddev %change %stddev
> >>> \ | \
> >>> 515.95 -50.0% 257.98 filebench.sum_operations/s
> >>> 4.81 ± 2% +1038.4% 54.78 ± 6% filebench.sum_time_ms/op
> >>> 29.00 ± 8% +212.1% 90.50 ± 3% filebench.time.percent_of_cpu_this_job_got
> >>> 24629 +2.7% 25297 filebench.time.voluntary_context_switches
> >>> 7.685e+08 +19.3% 9.169e+08 ± 4% cpuidle..time
> >>> 2.53 ± 6% -20.6% 2.00 ± 3% iostat.cpu.iowait
> >>> 1506141 ± 8% +22.4% 1843256 ± 3% turbostat.IRQ
> >>> 2.00 -50.0% 1.00 vmstat.procs.b
> >>> 21969 ± 2% -9.5% 19885 ± 2% vmstat.system.cs
> >>> 3.06 ± 7% -0.7 2.35 ± 4% mpstat.cpu.all.iowait%
> >>> 0.79 ± 5% +0.5 1.27 ± 2% mpstat.cpu.all.sys%
> >>> 0.89 ± 3% -0.1 0.79 ± 3% mpstat.cpu.all.usr%
> >>> 34.55 ± 14% -34.8% 22.51 ± 27% sched_debug.cfs_rq:/.removed.runnable_avg.avg
> >>> 119.64 ± 3% -20.0% 95.69 ± 17% sched_debug.cfs_rq:/.removed.runnable_avg.stddev
> >>> 34.55 ± 14% -34.8% 22.51 ± 27% sched_debug.cfs_rq:/.removed.util_avg.avg
> >>> 119.64 ± 3% -20.0% 95.69 ± 17% sched_debug.cfs_rq:/.removed.util_avg.stddev
> >>> 5249 +15.8% 6076 meminfo.Active
> >>> 3866 ± 2% +17.7% 4552 meminfo.Active(anon)
> >>> 1382 ± 4% +10.3% 1524 ± 4% meminfo.Active(file)
> >>> 69791 ± 14% +39.8% 97553 ± 6% meminfo.AnonHugePages
> >>> 72709 ± 2% +12.5% 81779 ± 3% meminfo.Inactive(file)
> >>> 23219 +13.5% 26352 ± 3% meminfo.KernelStack
> >>> 966.50 ± 2% +17.7% 1137 proc-vmstat.nr_active_anon
> >>> 74302 +6.3% 78977 ± 2% proc-vmstat.nr_anon_pages
> >>> 81133 +6.0% 85973 proc-vmstat.nr_inactive_anon
> >>> 18172 ± 2% +12.5% 20442 ± 3% proc-vmstat.nr_inactive_file
> >>> 23213 +13.5% 26348 ± 3% proc-vmstat.nr_kernel_stack
> >>> 17983 +2.3% 18400 proc-vmstat.nr_mapped
> >>> 7446 ± 2% +5.5% 7853 ± 3% proc-vmstat.nr_shmem
> >>> 26888 +1.6% 27306 proc-vmstat.nr_slab_reclaimable
> >>> 47220 +3.4% 48803 proc-vmstat.nr_slab_unreclaimable
> >>> 966.50 ± 2% +17.7% 1137 proc-vmstat.nr_zone_active_anon
> >>> 81133 +6.0% 85973 proc-vmstat.nr_zone_inactive_anon
> >>> 18172 ± 2% +12.5% 20442 ± 3% proc-vmstat.nr_zone_inactive_file
> >>> 361460 +2.5% 370454 proc-vmstat.numa_hit
> >>> 946.67 +18.6% 1122 proc-vmstat.pgactivate
> >>> 361562 +2.5% 370553 proc-vmstat.pgalloc_normal
> >>> 187906 +4.7% 196761 proc-vmstat.pgfault
> >>> 8189 +2.5% 8395 proc-vmstat.pgreuse
> >>> 1.097e+09 +15.5% 1.267e+09 ± 7% perf-stat.i.branch-instructions
> >>> 39079265 ± 6% -20.9% 30915354 ± 4% perf-stat.i.branch-misses
> >>> 5093263 ± 4% -23.7% 3884752 ± 9% perf-stat.i.cache-misses
> >>> 29213 -18.7% 23764 ± 5% perf-stat.i.context-switches
> >>> 7.666e+09 ± 4% +5.7% 8.106e+09 ± 2% perf-stat.i.cpu-cycles
> >>> 1877 ± 15% +75.1% 3287 ± 12% perf-stat.i.cycles-between-cache-misses
> >>> 1735450 ± 3% -12.9% 1512060 ± 3% perf-stat.i.iTLB-load-misses
> >>> 2898 ± 3% +34.4% 3895 ± 7% perf-stat.i.instructions-per-iTLB-miss
> >>> 1493 -20.3% 1190 ± 7% perf-stat.i.major-faults
> >>> 0.09 ± 3% +5.8% 0.09 ± 2% perf-stat.i.metric.GHz
> >>> 48.47 ± 11% +8.4 56.83 ± 7% perf-stat.i.node-store-miss-rate%
> >>> 283426 ± 4% -21.6% 222190 ± 10% perf-stat.i.node-stores
> >>> 3.57 ± 7% -1.1 2.44 ± 6% perf-stat.overall.branch-miss-rate%
> >>> 1508 ± 3% +39.8% 2108 ± 9% perf-stat.overall.cycles-between-cache-misses
> >>> 3022 ± 3% +23.6% 3736 ± 5% perf-stat.overall.instructions-per-iTLB-miss
> >>> 9.585e+08 +18.8% 1.138e+09 ± 6% perf-stat.ps.branch-instructions
> >>> 34151514 ± 6% -18.8% 27725316 ± 4% perf-stat.ps.branch-misses
> >>> 4450329 ± 5% -21.7% 3486409 ± 9% perf-stat.ps.cache-misses
> >>> 25524 -16.4% 21333 ± 4% perf-stat.ps.context-switches
> >>> 77139 +2.5% 79105 perf-stat.ps.cpu-clock
> >>> 6.704e+09 ± 4% +8.7% 7.287e+09 perf-stat.ps.cpu-cycles
> >>> 1.06e+09 +11.3% 1.18e+09 ± 5% perf-stat.ps.dTLB-loads
> >>> 1517349 ± 3% -10.5% 1357716 ± 2% perf-stat.ps.iTLB-load-misses
> >>> 4.582e+09 +10.8% 5.075e+09 ± 6% perf-stat.ps.instructions
> >>> 1296 -18.1% 1061 ± 6% perf-stat.ps.major-faults
> >>> 247613 ± 4% -19.5% 199283 ± 9% perf-stat.ps.node-stores
> >>> 77139 +2.5% 79105 perf-stat.ps.task-clock
> >>> 3.697e+10 +35.3% 5.003e+10 perf-stat.total.instructions
> >>> 8.51 ± 91% -6.9 1.59 ±223% perf-profile.calltrace.cycles-pp.getdents64
> >>> 8.34 ± 83% -6.7 1.67 ±141% perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >>> 6.25 ±107% -6.2 0.00 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.open64
> >>> 6.25 ±107% -6.2 0.00 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
> >>> 6.25 ±107% -6.2 0.00 perf-profile.calltrace.cycles-pp.open64
> >>> 7.63 ± 84% -6.0 1.59 ±223% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.getdents64
> >>> 7.63 ± 84% -6.0 1.59 ±223% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.getdents64
> >>> 7.63 ± 84% -6.0 1.59 ±223% perf-profile.calltrace.cycles-pp.__x64_sys_getdents64.do_syscall_64.entry_SYSCALL_64_after_hwframe.getdents64
> >>> 7.63 ± 84% -6.0 1.59 ±223% perf-profile.calltrace.cycles-pp.iterate_dir.__x64_sys_getdents64.do_syscall_64.entry_SYSCALL_64_after_hwframe.getdents64
> >>> 6.26 ±115% -5.4 0.88 ±223% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >>> 6.26 ±115% -5.4 0.88 ±223% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >>> 6.26 ±115% -4.6 1.67 ±141% perf-profile.calltrace.cycles-pp.do_group_exit.get_signal.arch_do_signal_or_restart.exit_to_user_mode_loop.exit_to_user_mode_prepare
> >>> 6.26 ±115% -4.6 1.67 ±141% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.get_signal.arch_do_signal_or_restart.exit_to_user_mode_loop
> >>> 6.26 ±115% -4.6 1.67 ±141% perf-profile.calltrace.cycles-pp.arch_do_signal_or_restart.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
> >>> 6.26 ±115% -4.6 1.67 ±141% perf-profile.calltrace.cycles-pp.get_signal.arch_do_signal_or_restart.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode
> >>> 4.57 ± 73% -3.8 0.76 ±223% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.perf_mmap__read_head.perf_mmap__push.record__mmap_read_evlist.__cmd_record
> >>> 4.57 ± 73% -3.8 0.76 ±223% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.perf_mmap__read_head.perf_mmap__push.record__mmap_read_evlist
> >>> 4.57 ± 73% -3.8 0.76 ±223% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.perf_mmap__read_head.perf_mmap__push
> >>> 4.57 ± 73% -3.8 0.76 ±223% perf-profile.calltrace.cycles-pp.perf_mmap__read_head.perf_mmap__push.record__mmap_read_evlist.__cmd_record.cmd_record
> >>> 3.38 ±103% -3.4 0.00 perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.mmput.exit_mm.do_exit
> >>> 3.38 ±103% -3.4 0.00 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.mmput.exit_mm
> >>> 3.38 ±103% -3.4 0.00 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.mmput
> >>> 3.38 ±103% -3.4 0.00 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
> >>> 8.51 ± 91% -6.9 1.59 ±223% perf-profile.children.cycles-pp.getdents64
> >>> 8.34 ± 83% -6.7 1.67 ±141% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
> >>> 8.34 ± 83% -6.7 1.67 ±141% perf-profile.children.cycles-pp.exit_to_user_mode_loop
> >>> 8.34 ± 83% -6.7 1.67 ±141% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
> >>> 6.25 ±107% -6.2 0.00 perf-profile.children.cycles-pp.open64
> >>> 7.63 ± 84% -6.0 1.59 ±223% perf-profile.children.cycles-pp.__x64_sys_getdents64
> >>> 7.63 ± 84% -6.0 1.59 ±223% perf-profile.children.cycles-pp.iterate_dir
> >>> 6.26 ±115% -4.6 1.67 ±141% perf-profile.children.cycles-pp.arch_do_signal_or_restart
> >>> 6.26 ±115% -4.6 1.67 ±141% perf-profile.children.cycles-pp.get_signal
> >>> 4.57 ± 73% -3.8 0.76 ±223% perf-profile.children.cycles-pp.perf_mmap__read_head
> >>> 3.38 ±103% -3.4 0.00 perf-profile.children.cycles-pp.unmap_vmas
> >>> 3.38 ±103% -3.4 0.00 perf-profile.children.cycles-pp.unmap_page_range
> >>> 3.38 ±103% -3.4 0.00 perf-profile.children.cycles-pp.zap_pmd_range
> >>> 3.38 ±103% -3.4 0.00 perf-profile.children.cycles-pp.zap_pte_range
> >>> 8.54 ± 43% +8.6 17.19 ± 38% perf-profile.children.cycles-pp.asm_exc_page_fault
> >>>
> >>>
> >>> If you fix the issue, kindly add following tag
> >>> Reported-by: kernel test robot <yujie.liu@xxxxxxxxx>
> >>>
> >>>
> >>> To reproduce:
> >>>
> >>> git clone https://github.com/intel/lkp-tests.git
> >>> cd lkp-tests
> >>> sudo bin/lkp install job.yaml # job file is attached in this email
> >>> bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> >>> sudo bin/lkp run generated-yaml-file
> >>>
> >>> # if come across any failure that blocks the test,
> >>> # please remove ~/.lkp and /lkp dir to run from a clean state.
> >>>
> >>>
> >>> Disclaimer:
> >>> Results have been estimated based on internal Intel analysis and are provided
> >>> for informational purposes only. Any difference in system hardware or software
> >>> design or configuration may affect actual performance.
> >>>
> >>>
> >>> #regzbot introduced: 5efdd9122e
> >>>
> >>>
> >
> >
> >



--
Thanks,

Steve
From 0e3e4f66130a077f0bf20bd29b8ad21d4d5a9314 Mon Sep 17 00:00:00 2001
From: Steve French <stfrench@xxxxxxxxxxxxx>
Date: Thu, 23 Mar 2023 15:10:26 -0500
Subject: [PATCH] smb3: lower default deferred close timeout to address perf
regression

Performance tests with large number of threads noted that the change
of the default closetimeo (deferred close timeout between when
close is done by application and when client has to send the close
to the server), to 5 seconds from 1 second, significantly degraded
perf in some cases like this (in the filebench example reported,
the stats show close requests on the wire taking twice as long,
and 50% regression in filebench perf). This is stil configurable
via mount parm closetimeo, but to be safe, decrease default back
to its previous value of 1 second.

Reported-by: Yin Fengwei <fengwei.yin@xxxxxxxxx>
Reported-by: kernel test robot <yujie.liu@xxxxxxxxx>
Link: https://lore.kernel.org/lkml/997614df-10d4-af53-9571-edec36b0e2f3@xxxxxxxxx/
Fixes: 5efdd9122eff ("smb3: allow deferred close timeout to be configurable")
Cc: stable@xxxxxxxxxxxxxxx # 6.0+
Reviewed-by: Paulo Alcantara (SUSE) <pc@xxxxxxxxxxxxx>
Acked-by: Shyam Prasad N <sprasad@xxxxxxxxxxxxx>
Signed-off-by: Steve French <stfrench@xxxxxxxxxxxxx>
---
fs/cifs/fs_context.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/fs_context.h b/fs/cifs/fs_context.h
index 1b8d4e27f831..3de00e7127ec 100644
--- a/fs/cifs/fs_context.h
+++ b/fs/cifs/fs_context.h
@@ -286,5 +286,5 @@ extern void smb3_update_mnt_flags(struct cifs_sb_info *cifs_sb);
* max deferred close timeout (jiffies) - 2^30
*/
#define SMB3_MAX_DCLOSETIMEO (1 << 30)
-#define SMB3_DEF_DCLOSETIMEO (5 * HZ) /* Can increase later, other clients use larger */
+#define SMB3_DEF_DCLOSETIMEO (1 * HZ) /* even 1 sec enough to help eg open/write/close/open/read */
#endif
--
2.34.1