Inode Lock Scalability V6

From: Dave Chinner
Date: Wed Oct 20 2010 - 20:50:41 EST


This patch set is derived from Nick Piggin's VFS scalability tree.
This is an attempt to push the process of finer grained review of
the series for upstream inclusion. I'm hitting VFS lock contention
problems with XFS on 8-16p machines now, so I need to get this stuff
moving.

This patch set is just the basic inode_lock breakup patches plus a
few more simple changes to the inode code. It stops short of
introducing RCU inode freeing because those changes are not
completely baked yet.

As a result, the full inode handling improvements of Nick's patch
set are not realised with this short series. However, my own testing
indicates that the amount of lock traffic and contention is down by
an order of magnitude on an 8-way box for parallel inode create and
unlink workloads, so there is still significant improvements from
just this patch set.

Version 2 of this series was a complete rework of the original patch
series. I've pulled in several of the cleanups and re-ordered the
series such that cleanups, factoring and list splitting are done
before any of the locking changes. Instead of converting the inode
state flags first, I've converted them last, ensuring that
manipulations are kept inside other locks rather than outside them.

The series is made up of the following steps:

- inode counters are made per-cpu
- inode LRU manipulations are made lazy
- i_list is split into two lists (grows inode by 2
pointers), one for tracking lru status, one for writeback
status
- reference counting is factored, then renamed and locked
differently
- protect iunique counter with it's own lock
- hash lookups and reference counting is cleaned up
- inode hash operations are factored, then locked per bucket
- superblock inode listis locked per-superblock
- inode LRU is locked via a global lock
- unclear what the best way to split this up from
here is, so no attempt is made to optimise
further.
- Currently not showing signs of contention under
any workload on an 8p machine.
- inode IO list are locked via a per-BDI lock
- further analysis needed to determine the next step
in optimising this list. It is extremely contended
under parallel workloads because foreground
throttling (balance_dirty_pages) causes unbound
writeback parallelism and contention. Fixing the
unbound parallelism, I think, is a more important
first optimisation step than making the list
per-cpu.
- lock i_state operations with i_lock
- removed unnecessary i_state lock avoidance optimisations
- convert last_ino allocation to a percpu counter
- remove inode_lock
- push inode number assignment out of the inode allocation code and
into the filesystems that require it
- factor destroying an inode into dispose_one_inode() which
is called from reclaim, dispose_list and iput_final.

Version 6:
- removed reference to sb_inode_list_lock in documentation
- remove references to writeback_single_inode in comments.
- cleaned up some typos reported by Christian Stroetmann
<stroetmann@xxxxxxxxxxxxx>.
- dropped unnecessary EXPORT_SYMBOL for bdi_lock_two().
- cleaned up stale remove_inode_hash comment.
- added a new patch to fix an inode hash lookup/removal race by the
protecting wake_up_inode() with the i_lock. This also removes the
now unnecessary memory barrier based inode_lock contention
optimisation for clearing I_NEW in unlock_new_inode.

Version 5:
- removed buggy can_unuse() optimisation in prune_icache that the
lazy LRU code exposes.
- Christoph found a nasty bug in the new hash locking code where the
hash lock is dropped between the lookup and insert in
get_new_inode[_fast](). This lookup and insert needs to be atomic,
so it needs fixing. Thanks to Christoph for finding and fixing
the bug.

Detailed changes:
- iunique rework moved forward in the series to before the
inode hash locking changes
- new patch to move inode reference on successful lookup
back inside find_inode[_fast]()
- moved splitting of inode_add_to_lists forward to before
the inode hash locking changes
- modified the intorudction of the new inode hash list locks
to be taken outside find_inode[_fast]() and held until the
new inode is inserted into the hash. They cover the same
scope as the inode_lock covered. This is the bug fix.

Version 4:
- re-added inode reference count check in writeback_single_inode()
when the inode is clean and only attempt to add the inode to the
LRU if the inodis unreferenced.
- moved hash_bl_[un]lock into hlist_bl.h introductory patch.
- updated documentation and comments still referencing i_count
- updated documentation and comments still referencing inode_lock
- removed a couple of unneeded include files.
- writeback_single_inode() and sync_inode are now the same, so fold
writeback_single_inode() into sync_inode.
- moved lock ordering comments around into the patches that
introduce the locks or change the ordering.
- cleaned up dispose_one_inode comments and layout.
- added patch to start of series to move bdev inodes around bdi's
as they change the bdi in the inode mapping during the final put
of the bdev. Changes to this new code propagate throw the subsequent
scalability patches.

Version 3:
- whitespace fix in inode_init_early.
- dropped patch that moves inodes around bdi lists as problem is now
fixed in mainline.
- added comments explaining lazy inode LRU manipulations.
- added inode_lru_list_{add,del} helpers much earlier to avoid
needing to export then unexport inode counters.
- renamed i_io to i_wb_list.
- removed iref_locked and just open code internal inode reference
increments.
- added a WARN_ON() condition to detect iref() being called without
a pre-existing reference count.
- added kerneldoc comment to iref().
- dropped iref_read() wrapper function patch
- killed the inode_hash_bucket wrapper, use hlist_bl_head directly
- moved spin_[un]lock_bucket wrappers to list_bl.h, and renamed them
hlist_bl_[un]lock()
- added inode_unhashed() helper function.
- documented use of I_FREEING to ensure removal from inode lru and
writeback lists is kept sane when the inode is being freed.
- added inode_wb_list_del() helper to avoid exporting the
inode_to_bdi() function.
- added comments to explain why we need to set the i_state field
before adding new inodes to various lists
- renamed last_ino_get() to get_next_ino().
- kept invalidate_list/dispose_list pairing for invalidate_inodes(),
but changed the dispose list to use the i_sb_list pointer in the
inode instead of the i_lru to avoid needing to take the
inode_lru_lock for every inode on the superblock list.
- added patch from Christoph Hellwig to spilt up inode_add_to_lists.
Modified the new function names to match the naming convention
used by all the other list helpers in inode.c, and added a
matching inode_sb_list_del() function for symmetry.
- added patch from Christoph Hellwig to move inode number assignment
in get_new_inode() to the callers that don't directly assign an
inode number.

Version 2:
- complete rework of series
--
The following changes since commit cb655d0f3d57c23db51b981648e452988c0223f9:

Linux 2.6.36-rc7 (2010-10-06 13:39:52 -0700)

are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

Christoph Hellwig (4):
fs: Stop abusing find_inode_fast in iunique
fs: move i_ref increments into find_inode/find_inode_fast
fs: remove inode_add_to_list/__inode_add_to_list
fs: do not assign default i_ino in new_inode

Dave Chinner (13):
fs: switch bdev inode bdi's correctly
fs: Convert nr_inodes and nr_unused to per-cpu counters
fs: Clean up inode reference counting
exofs: use iput() for inode reference count decrements
fs: rework icount to be a locked variable
fs: Factor inode hash operations into functions
fs: Introduce per-bucket inode hash locks
fs: add a per-superblock lock for the inode list
fs: split locking of inode writeback and LRU lists
fs: Protect inode->i_state with the inode->i_lock
fs: protect wake_up_inode with inode->i_lock
fs: icache remove inode_lock
fs: Reduce inode I_FREEING and factor inode disposal

Eric Dumazet (1):
fs: introduce a per-cpu last_ino allocator

Nick Piggin (3):
kernel: add bl_list
fs: Implement lazy LRU updates for inodes
fs: inode split IO and LRU lists

Documentation/filesystems/Locking | 2 +-
Documentation/filesystems/porting | 8 +-
Documentation/filesystems/vfs.txt | 16 +-
arch/powerpc/platforms/cell/spufs/file.c | 2 +-
drivers/infiniband/hw/ipath/ipath_fs.c | 1 +
drivers/infiniband/hw/qib/qib_fs.c | 1 +
drivers/misc/ibmasm/ibmasmfs.c | 1 +
drivers/oprofile/oprofilefs.c | 1 +
drivers/usb/core/inode.c | 1 +
drivers/usb/gadget/f_fs.c | 1 +
drivers/usb/gadget/inode.c | 1 +
fs/9p/vfs_inode.c | 5 +-
fs/affs/inode.c | 2 +-
fs/afs/dir.c | 2 +-
fs/anon_inodes.c | 8 +-
fs/autofs4/inode.c | 1 +
fs/bfs/dir.c | 2 +-
fs/binfmt_misc.c | 1 +
fs/block_dev.c | 42 +-
fs/btrfs/inode.c | 18 +-
fs/buffer.c | 2 +-
fs/ceph/mds_client.c | 2 +-
fs/cifs/inode.c | 2 +-
fs/coda/dir.c | 2 +-
fs/configfs/inode.c | 1 +
fs/debugfs/inode.c | 1 +
fs/drop_caches.c | 19 +-
fs/exofs/inode.c | 6 +-
fs/exofs/namei.c | 2 +-
fs/ext2/namei.c | 2 +-
fs/ext3/ialloc.c | 4 +-
fs/ext3/namei.c | 2 +-
fs/ext4/ialloc.c | 4 +-
fs/ext4/mballoc.c | 1 +
fs/ext4/namei.c | 2 +-
fs/freevxfs/vxfs_inode.c | 1 +
fs/fs-writeback.c | 235 +++++----
fs/fuse/control.c | 1 +
fs/gfs2/ops_inode.c | 2 +-
fs/hfs/hfs_fs.h | 2 +-
fs/hfs/inode.c | 2 +-
fs/hfsplus/dir.c | 2 +-
fs/hfsplus/hfsplus_fs.h | 2 +-
fs/hfsplus/inode.c | 2 +-
fs/hpfs/inode.c | 2 +-
fs/hugetlbfs/inode.c | 1 +
fs/inode.c | 850 +++++++++++++++++++-----------
fs/internal.h | 11 +
fs/jffs2/dir.c | 4 +-
fs/jfs/jfs_txnmgr.c | 2 +-
fs/jfs/namei.c | 2 +-
fs/libfs.c | 2 +-
fs/locks.c | 2 +-
fs/logfs/dir.c | 2 +-
fs/logfs/inode.c | 2 +-
fs/logfs/readwrite.c | 2 +-
fs/minix/namei.c | 2 +-
fs/namei.c | 2 +-
fs/nfs/dir.c | 2 +-
fs/nfs/getroot.c | 2 +-
fs/nfs/inode.c | 4 +-
fs/nfs/nfs4state.c | 2 +-
fs/nfs/write.c | 2 +-
fs/nilfs2/gcdat.c | 1 +
fs/nilfs2/gcinode.c | 22 +-
fs/nilfs2/mdt.c | 5 +-
fs/nilfs2/namei.c | 2 +-
fs/nilfs2/segment.c | 2 +-
fs/nilfs2/the_nilfs.h | 2 +-
fs/notify/inode_mark.c | 46 +-
fs/notify/mark.c | 1 -
fs/notify/vfsmount_mark.c | 1 -
fs/ntfs/inode.c | 10 +-
fs/ntfs/super.c | 6 +-
fs/ocfs2/dlmfs/dlmfs.c | 2 +
fs/ocfs2/inode.c | 2 +-
fs/ocfs2/namei.c | 2 +-
fs/pipe.c | 2 +
fs/proc/base.c | 2 +
fs/proc/proc_sysctl.c | 2 +
fs/quota/dquot.c | 32 +-
fs/ramfs/inode.c | 1 +
fs/reiserfs/namei.c | 2 +-
fs/reiserfs/stree.c | 2 +-
fs/reiserfs/xattr.c | 2 +-
fs/smbfs/inode.c | 2 +-
fs/super.c | 1 +
fs/sysv/namei.c | 2 +-
fs/ubifs/dir.c | 2 +-
fs/ubifs/super.c | 2 +-
fs/udf/inode.c | 2 +-
fs/udf/namei.c | 2 +-
fs/ufs/namei.c | 2 +-
fs/xfs/linux-2.6/xfs_buf.c | 1 +
fs/xfs/linux-2.6/xfs_iops.c | 6 +-
fs/xfs/linux-2.6/xfs_trace.h | 2 +-
fs/xfs/xfs_inode.h | 3 +-
include/linux/backing-dev.h | 3 +
include/linux/fs.h | 43 +-
include/linux/list_bl.h | 146 +++++
include/linux/poison.h | 2 +
include/linux/writeback.h | 4 -
ipc/mqueue.c | 3 +-
kernel/cgroup.c | 1 +
kernel/futex.c | 2 +-
kernel/sysctl.c | 4 +-
mm/backing-dev.c | 28 +-
mm/filemap.c | 6 +-
mm/rmap.c | 6 +-
mm/shmem.c | 7 +-
net/socket.c | 3 +-
net/sunrpc/rpc_pipe.c | 1 +
security/inode.c | 1 +
security/selinux/selinuxfs.c | 1 +
114 files changed, 1128 insertions(+), 624 deletions(-)
create mode 100644 include/linux/list_bl.h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/