Re: [rfc][patch] store-free path walking

From: Nick Piggin
Date: Thu Oct 08 2009 - 08:38:01 EST


On Wed, Oct 07, 2009 at 07:56:33AM -0700, Linus Torvalds wrote:
> On Wed, 7 Oct 2009, Nick Piggin wrote:
> >
> > OK, I have a really basic patch that does store-free path walking
> > (except on the final element).
>
> Yay!
>
> > dbench is pretty nasty still because it seems to do a lot of stupid
> > things like reading from /proc/mounts all the time.
>
> You should largely forget about dbench, it can certainly be a useful
> benchmark, but at the same time it's certainly not a _meaningful_ one.
> There are better things to try.

OK, here's one you might find interesting. It is a cached git diff
workload in a linux kernel tree. I actually ran it in a loop 100
times in order to get some reasonable sample sizes, then I ran
parallel and serial configs (PreloadIndex = true/false). Compared
plain kernel with all vfs patches to now.

2.6.32-rc3 serial
5.35user 7.12system 0:12.47elapsed 100%CPU

2.6.32-rc3 parallel
5.79user 17.69system 0:09.41elapsed 249%CPU

vfs serial
5.30user 5.62system 0:10.92elapsed 100%CPU

vfs parallel
4.86user 0.68system 0:06.82elapsed 81%CPU

(I don't know what happened with CPU accounting on the last one, but
elapsed time was accurate).

The profiles are interesting. It's pretty verbose but I've included
just the backtraces for the locking functions.

serial
plain
# Samples: 288849
#
# Overhead Command Shared Object
# ........ .............. ................................
#
55.46% git [kernel]
|
|--36.52%-- __d_lookup
|--9.57%-- __link_path_walk
|--6.26%-- _atomic_dec_and_lock
| |
| |--39.42%-- dput
| | |
| | |--53.66%-- path_put
| | | |
| | | |--90.91%-- vfs_fstatat
| | | | vfs_lstat
| | | | sys_newlstat
| | | | system_call_fastpath
| | | |
| | | --9.09%-- path_walk
| | | do_path_lookup
| | | user_path_at
| | | vfs_fstatat
| | | vfs_lstat
| | | sys_newlstat
| | | system_call_fastpath
| | |
| | --46.34%-- __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--31.73%-- path_put
| | |
| | |--57.58%-- vfs_fstatat
| | | vfs_lstat
| | | sys_newlstat
| | | system_call_fastpath
| | |
| | --42.42%-- path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--21.15%-- __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| --7.69%-- mntput_no_expire
| path_put
| |
| |--50.00%-- vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| --50.00%-- path_walk
| do_path_lookup
| user_path_at
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
|
|--5.78%-- strncpy_from_user
|--5.60%-- _spin_unlock
| |
| |--88.17%-- dput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--4.30%-- path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--3.23%-- do_lookup
| | __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--2.15%-- handle_mm_fault
| | do_page_fault
| | page_fault
| |
| --2.15%-- __d_lookup
| do_lookup
| __link_path_walk
| path_walk
| do_path_lookup
| user_path_at
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
|
|--5.17%-- generic_fillattr
|--2.95%-- acl_permission_check
|--1.87%-- groups_search
|--1.81%-- kmem_cache_free
|--1.68%-- system_call
|--1.62%-- clear_page_c
|--1.56%-- do_lookup
|--1.44%-- _spin_lock
| |
| |--58.33%-- __d_lookup
| | do_lookup
| | __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| |--20.83%-- dput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| |--16.67%-- do_lookup
| | __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| --4.17%-- copy_process
| do_fork
| sys_clone
| stub_clone
| __libc_fork
| 0x494a5d
|
|--1.38%-- dput
|--1.38%-- mntput_no_expire
|--1.32%-- cp_new_stat
|--1.26%-- path_walk
|--1.20%-- sysret_check
|--1.08%-- kmem_cache_alloc
|--0.96%-- __follow_mount
|--0.96%-- copy_user_generic_string
|--0.66%-- in_group_p
|--0.54%-- page_fault
--7.40%-- [...]

So serial case still has significant time in locking. 13% of all kernel
cycles.

vfs
amples: 254207
#
# Overhead Command Shared Object
# ........ .............. ................................
#
53.15% git [kernel]
|
|--37.47%-- __d_lookup_rcu
|--15.63%-- link_path_walk_rcu
|--6.70%-- strncpy_from_user
|--5.65%-- generic_fillattr
|--3.49%-- _spin_lock
| |
| |--66.00%-- dput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--14.00%-- mntput_no_expire
| | mntput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--6.00%-- link_path_walk_rcu
| | do_path_lookup
| | |
| | |--66.67%-- user_path_at
| | | vfs_fstatat
| | | vfs_lstat
| | | sys_newlstat
| | | system_call_fastpath
| | |
| | --33.33%-- do_filp_open
| | do_sys_open
| | sys_open
| | system_call_fastpath
| |
| |--4.00%-- path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--4.00%-- do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--2.00%-- anon_vma_link
| | dup_mm
| | copy_process
| | do_fork
| | sys_clone
| | stub_clone
| | __libc_fork
| |
| |--2.00%-- do_page_fault
| | page_fault
| |
| --2.00%-- vfsmount_read_lock
| mntput_no_expire
| mntput
| path_put
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
|
|--2.44%-- kmem_cache_free
|--1.95%-- system_call
|--1.88%-- groups_search
|--1.81%-- do_path_lookup
|--1.54%-- cp_new_stat
|--1.33%-- clear_page_c
|--1.33%-- kmem_cache_alloc
|--1.12%-- mntput_no_expire
|--1.05%-- do_lookup_rcu
|--0.98%-- dput
|--0.91%-- page_fault
|--0.91%-- copy_user_generic_string
|--0.77%-- sysret_check
|--0.77%-- in_group_p
|--0.77%-- getname
|--0.70%-- _spin_unlock
| |
| |--30.00%-- mntput_no_expire
| | mntput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| |--20.00%-- link_path_walk_rcu
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| |--10.00%-- handle_mm_fault
| | do_page_fault
| | page_fault
| | 0x45f62a
| |
| |--10.00%-- vfsmount_read_unlock
| | mntput_no_expire
| | mntput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| |--10.00%-- dput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| |--10.00%-- path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| |
| --10.00%-- do_path_lookup
| user_path_at
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
| __lxstat
|
|--0.63%-- path_put
|--0.56%-- copy_page_c
|--0.56%-- user_path_at
--9.07%-- [...]

Locking goes to about 4%. Signifciantly coming from dput of the final
dentry element which is basically impossible to avoid, so we're much
closer to optimal.

The parallel case is interesting too.
plain
# Samples: 635836
#
# Overhead Command Shared Object
# ........ .............. ................................
#
76.39% git [kernel]
|
|--32.26%-- _atomic_dec_and_lock
| |
| |--60.44%-- dput
| | |
| | |--51.15%-- path_put
| | | |
| | | |--94.91%-- path_walk
| | | | do_path_lookup
| | | | user_path_at
| | | | vfs_fstatat
| | | | vfs_lstat
| | | | sys_newlstat
| | | | system_call_fastpath
| | | |
| | | --5.09%-- vfs_fstatat
| | | vfs_lstat
| | | sys_newlstat
| | | system_call_fastpath
| | |
| | --48.85%-- __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--14.04%-- mntput_no_expire
| | path_put
| | |
| | |--51.29%-- path_walk
| | | do_path_lookup
| | | user_path_at
| | | vfs_fstatat
| | | vfs_lstat
| | | sys_newlstat
| | | system_call_fastpath
| | |
| | --48.71%-- vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--13.01%-- path_put
| | |
| | |--95.81%-- path_walk
| | | do_path_lookup
| | | user_path_at
| | | vfs_fstatat
| | | vfs_lstat
| | | sys_newlstat
| | | system_call_fastpath
| | |
| | --4.19%-- vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| --12.52%-- __link_path_walk
| path_walk
| do_path_lookup
| user_path_at
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
|
|--13.23%-- path_walk
|--12.94%-- __d_lookup
|--7.81%-- do_path_lookup
|--7.53%-- path_init
|--3.84%-- __link_path_walk
|--2.36%-- acl_permission_check
|--2.15%-- _spin_lock
| |
| |--42.73%-- _atomic_dec_and_lock
| | dput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--39.09%-- __d_lookup
| | do_lookup
| | __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--9.09%-- do_lookup
| | __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--8.18%-- dput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| --0.91%-- system_call_fastpath
| 0x7fb0fcf23257
| 0x7fb0fcf158bd
|
|--2.01%-- generic_fillattr
|--1.76%-- _spin_unlock
| |
| |--85.56%-- dput
| | path_put
| | |
| | |--98.70%-- vfs_fstatat
| | | vfs_lstat
| | | sys_newlstat
| | | system_call_fastpath
| | |
| | --1.30%-- __link_path_walk
| | path_walk
| | do_path_lookup
| | do_filp_open
| | do_sys_open
| | sys_open
| | system_call_fastpath
| |
| |--5.56%-- __d_lookup
| | do_lookup
| | __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--4.44%-- path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--2.22%-- do_lookup
| | __link_path_walk
| | path_walk
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--1.11%-- handle_mm_fault
| | do_page_fault
| | page_fault
| |
| --1.11%-- update_process_times
| tick_sched_timer
| __run_hrtimer
| hrtimer_interrupt
| smp_apic_timer_interrupt
| apic_timer_interrupt
|
|--1.62%-- _read_unlock
| |
| |--75.90%-- path_init
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| --24.10%-- do_path_lookup
| user_path_at
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
|
|--1.29%-- strncpy_from_user
|--1.17%-- path_put
|--1.01%-- dput
|--0.62%-- kmem_cache_free
|--0.60%-- do_lookup
|--0.59%-- clear_page_c

We can see it is really starting to choke on atomic_dec_and_lock. I
don't know how many tasks you spawn off in git here, but it looks
like this is nearing the absolute limit of scalbility.

vfs

amples: 273522
#
# Overhead Command Shared Object
# ........ .............. ................................
#
48.24% git [kernel]
|
|--32.37%-- __d_lookup_rcu
|--14.14%-- link_path_walk_rcu
|--7.57%-- _read_unlock
| |
| |--96.46%-- path_init_rcu
| | do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| --3.54%-- do_path_lookup
| user_path_at
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
|
|--7.04%-- generic_fillattr
|--5.50%-- strncpy_from_user
|--2.68%-- kmem_cache_free
|--2.55%-- _spin_lock
| |
| |--81.58%-- dput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--5.26%-- do_path_lookup
| | user_path_at
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| |
| |--5.26%-- try_to_wake_up
| | |
| | |--50.00%-- wake_up_state
| | | wake_futex
| | | futex_wake
| | | do_futex
| | | sys_futex
| | | mm_release
| | | exit_mm
| | | do_exit
| | | sys_exit
| | | system_call_fastpath
| | | start_thread
| | |
| | --50.00%-- wake_up_process
| | __up_write
| | up_write
| | sys_mmap
| | system_call_fastpath
| | mmap64
| |
| |--5.26%-- vfsmount_read_lock
| | mntput_no_expire
| | mntput
| | path_put
| | vfs_fstatat
| | vfs_lstat
| | sys_newlstat
| | system_call_fastpath
| | __lxstat
| | |
| | |--50.00%-- 0x7f7640b9e2c0
| | | 0x4ab3b1fc
| | |
| | --50.00%-- 0x7f7640bb4e78
| | 0x4a803476
| |
| --2.63%-- path_put
| vfs_fstatat
| vfs_lstat
| sys_newlstat
| system_call_fastpath
| __lxstat
| 0x7f7640d7f488
| 0x4a8034a4
|
|--2.48%-- clear_page_c
|--1.61%-- system_call
|--1.47%-- copy_user_generic_string
|--1.41%-- cp_new_stat
|--1.41%-- groups_search
|--1.21%-- do_lookup_rcu
|--0.94%-- kmem_cache_alloc
|--0.94%-- do_path_lookup
|--0.87%-- in_group_p
|--0.80%-- page_fault
|--0.80%-- sysret_check
|--0.74%-- dput
|--0.67%-- getname
|--0.67%-- user_path_at
|--0.67%-- mntput_no_expire
|--0.60%-- unmap_vmas
|--0.54%-- _spin_unlock
|--0.54%-- vfs_fstatat
|--0.54%-- path_init_rcu
--9.25%-- [...]

This one is interesting. spin_lock/spin_unlock remains very low, however
read_unlock pops up. This would be... fs->lock. You're using threads
then (rather than processes)?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/