Re: [PATCH] deadlock with suspend and quotas

From: Jan Kara
Date: Tue Nov 29 2011 - 05:21:59 EST


Sorry for replying once more - forgot to add those CCs - so please reply
to this email...

On Tue 29-11-11 11:19:01, Jan Kara wrote:
> Hi,
>
> On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > Hi
> > >
> > > Where can I get that patch set?
> > >
> > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or
> > > background writeback (these code paths take s_umount and wait trying to do
> > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel?
> > > Are there other known deadlock possibilities?
> >
> > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write
> > deadlock" (I couldn't find the next two parts of the patch in the
> > archives). And the patch looks wrong:
> Yes, that seems to be the series. I generally agree with you that the
> last iteration still had some problems and some changes were requested.
> That's why it's not merged yet after all...
>
> > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not
> > held when the filesystem is frozen and it is taken for write when thawing.
> > Consequently, any task can succeed with down_read_trylock(&sb->s_umount)
> > on a frozen filesystem and if this tasks attempts to do an I/O that is
> > waiting for thaw, it may still deadlock.
> Agreed.
>
> > - skipping sync on frozen filesystem violates sync semantics.
> > Applications, such as databases, assume that when sync finishes, data were
> > written to stable storage. If we skip sync when the filesystem is frozen,
> > we can cause data corruption in these applications (if the system crashes
> > after we skipped a sync).
> Here I don't agree. Filesystem must guarantee there are no dirty data on
> a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
> page_mkwrite() implementation for this but that's the problem of ext3, not
> freezing code in general. If there are no dirty data, sync code (and also
> flusher thread) is free to return without doing anything.
>
> That being said, it is hard to implement freeze handling in page_mkwrite()
> in such a way that there would be no dirty pages (but we know there cannot
> be dirty data in such pages). Currently we mark the page dirty during page
> fault and wait for frozen filesystem only after that so that we are
> guaranteed that either freezing code will wait for page fault to finish and
> will write the page or page fault code notices that freezing is in progress
> and blocks (see fs/buffer.c:block_page_mkwrite()).
>
> So I believe the consensus was that we should not block sync or flusher
> thread on frozen filesystem. Firstly, it's kind of ugly from user
> perspective (you cannot sync filesystems on your system while one
> filesystem is frozen???), secondly, in case of flusher thread it has some
> serious implications if there are more filesystems on the same device - you
> would effectively stop any writeback to the device possibly hanging the
> whole system due to dirty limit being exceeded. So at least in these two
> cases we should just ignore frozen filesystem.
>
> > - I'm not sure what userspace quota subsystem will do if we start
> > returning -EBUSY spuriously.
> Quota tools will complain to the user which would be fine I think. But
> blocking is fine as well. In this particular case I don't care much but it
> should be consistent with what happens to sync. So probably Q_SYNC command
> should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.
>
> > There is another thing --- I wasn't able to reproduce these sync-related
> > deadlocks at all. Did anyone succeeded in reproducing them? Are there any
> > reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel
> > prevents creating new dirty data, syncs all data, and freezes the
> > filesystem. Consequently, the sync function never finds any dirty data and
> > so it doesn't block (sync doesn't writeback ATIME change, I don't know
> > why).
> See above why sync can actually spot some dirty inodes/page (although
> there is not any dirty data). Surbhi (added to CC) from Canonical could
> actually trigger these races and consequent deadlocks (and I belive some of
> their customers as well). Also some RH customers were hitting these
> deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
> happy by my changes to block_page_mkwrite() which made the race window much
> narrower.
>
> > I made this patch that fixes the quota issue and possible sync issues. It
> > introduces a function down_read_s_umount(sb); --- this function takes
> > s_umount lock for read, but it waits if the filesystem is suspended.
> >
> > There are three more potentially unsafe users: fs/cachefiles/interface.c,
> > fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't
> > test them.
> >
> > Mikulas
> >
> > ---
> >
> > Fix a s_umount deadlock
> >
> > The lock s_umount is taken for write and dropped when we freeze and resume
> > a block device.
> >
> > Consequently, if some code holds s_umount and performs an I/O, a deadlock may
> > happen if the device is suspended.
> >
> > Unfortunatelly, it seems that developers are not aware of this restriction
> > that I/O must not be peformed with s_umount held.
> >
> > These deadlocks were observed:
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> > ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> > call_rwsem_down_write_failed (the process is trying to resume frozen device,
> > but it is waiting trying to acquire s_umount)
> > * quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
> > ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
> > start_this_handle (the process is holding s_umount and trying to perform
> > a transaction, waiting for the device being unfrozen)
> >
> > * iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
> > writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
> > (the process is holding s_umount for read and trying to perform some I/O)
> > * flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
> > wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
> > writeback_single_inode -> do_writepages -> ext4_da_writepages ->
> > ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
> > (holding s_umount for read too, acquired in writeback_inodes_wb,
> > and trying to perform I/O)
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> > ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> > call_rwsem_down_write_failed (just like in a previous case: trying to
> > resume frozen device, waiting on s_umount)
> >
> > This patch fixes these observed code paths:
> > * introduce a function to safely take s_unlock - down_read_s_umount. It takes
> > the lock, check if a filesystem is frozen, if it is, drops the lock and
> > waits for unfreeze.
> > * get_super function has a parameter, if the parameter is true, it waits for
> > the device to unfreeze (it fixes quota deadlock and a possible deadlock in
> > fsync_bdev and __invalidate_device)
> > * the same for iterate_supers (it fixes the sync deadlock and a possible
> > deadlock in drop_caches_sysctl_handler)
> > * grab_super_passive fails if the device is suspended (fixes the inode writeback
> > deadlock and a possible deadlock in prune_super)
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> > CC: stable@xxxxxxxxxx
> >
> > ---
> > fs/block_dev.c | 6 +++---
> > fs/buffer.c | 2 +-
> > fs/drop_caches.c | 2 +-
> > fs/fs-writeback.c | 8 ++++++++
> > fs/quota/quota.c | 4 ++--
> > fs/super.c | 29 ++++++++++++++++++++---------
> > fs/sync.c | 4 ++--
> > include/linux/fs.h | 28 +++++++++++++++++++++++++---
> > security/selinux/hooks.c | 2 +-
> > 9 files changed, 63 insertions(+), 22 deletions(-)
> >
> > Index: linux-3.2-rc3-fast/fs/quota/quota.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/quota/quota.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/quota/quota.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ static int quota_sync_all(int type)
> > return -EINVAL;
> > ret = security_quotactl(Q_SYNC, type, 0, NULL);
> > if (!ret)
> > - iterate_supers(quota_sync_one, &type);
> > + iterate_supers(quota_sync_one, &type, true);
> > return ret;
> > }
> >
> > @@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
> > putname(tmp);
> > if (IS_ERR(bdev))
> > return ERR_CAST(bdev);
> > - sb = get_super(bdev);
> > + sb = get_super(bdev, true);
> > bdput(bdev);
> > if (!sb)
> > return ERR_PTR(-ENODEV);
> > Index: linux-3.2-rc3-fast/include/linux/fs.h
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/include/linux/fs.h 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/include/linux/fs.h 2011-11-25 05:56:03.000000000 +0100
> > @@ -10,6 +10,7 @@
> > #include <linux/ioctl.h>
> > #include <linux/blk_types.h>
> > #include <linux/types.h>
> > +#include <linux/sched.h>
> >
> > /*
> > * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> > @@ -1502,6 +1503,26 @@ enum {
> > wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
> >
> > /*
> > + * Take s_umount and make sure that the filesystem is not frozen.
> > + * This function must be used if we intend to perform any I/O while
> > + * holding s_umount.
> > + *
> > + * s_umount is taken for write when resuming a frozen filesystem,
> > + * thus if someone calls down_read(&s->s_umount) and performs any I/O,
> > + * it may deadlock.
> > + */
> > +static inline void down_read_s_umount(struct super_block *sb)
> > +{
> > +retry:
> > + down_read(&sb->s_umount);
> > + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > + up_write(&sb->s_umount);
> > + vfs_check_frozen(sb, SB_FREEZE_WRITE);
> > + goto retry;
> > + }
> > +}
> > +
> > +/*
> > * until VFS tracks user namespaces for inodes, just make all files
> > * belong to init_user_ns
> > */
> > @@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
> > extern void get_filesystem(struct file_system_type *fs);
> > extern void put_filesystem(struct file_system_type *fs);
> > extern struct file_system_type *get_fs_type(const char *name);
> > -extern struct super_block *get_super(struct block_device *);
> > +extern struct super_block *get_super(struct block_device *, bool);
> > extern struct super_block *get_active_super(struct block_device *bdev);
> > extern struct super_block *user_get_super(dev_t);
> > extern void drop_super(struct super_block *sb);
> > -extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> > +extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
> > extern void iterate_supers_type(struct file_system_type *,
> > - void (*)(struct super_block *, void *), void *);
> > + void (*)(struct super_block *, void *), void *,
> > + bool);
> >
> > extern int dcache_dir_open(struct inode *, struct file *);
> > extern int dcache_dir_close(struct inode *, struct file *);
> > Index: linux-3.2-rc3-fast/fs/buffer.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/buffer.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/buffer.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo
> >
> > static void do_thaw_all(struct work_struct *work)
> > {
> > - iterate_supers(do_thaw_one, NULL);
> > + iterate_supers(do_thaw_one, NULL, false);
> > kfree(work);
> > printk(KERN_WARNING "Emergency Thaw complete\n");
> > }
> > Index: linux-3.2-rc3-fast/fs/super.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/super.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/super.c 2011-11-29 00:16:01.000000000 +0100
> > @@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
> > spin_unlock(&sb_lock);
> >
> > if (down_read_trylock(&sb->s_umount)) {
> > - if (sb->s_root)
> > + if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> > return true;
> > up_read(&sb->s_umount);
> > }
> > @@ -503,7 +503,7 @@ void sync_supers(void)
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> >
> > - down_read(&sb->s_umount);
> > + down_read_s_umount(sb);
> > if (sb->s_root && sb->s_dirt)
> > sb->s_op->write_super(sb);
> > up_read(&sb->s_umount);
> > @@ -527,7 +527,8 @@ void sync_supers(void)
> > * Scans the superblock list and calls given function, passing it
> > * locked superblock and given argument.
> > */
> > -void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> > +void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> > + bool wait_for_unfreeze)
> > {
> > struct super_block *sb, *p = NULL;
> >
> > @@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> >
> > - down_read(&sb->s_umount);
> > + if (!wait_for_unfreeze)
> > + down_read(&sb->s_umount);
> > + else
> > + down_read_s_umount(sb);
> > if (sb->s_root)
> > f(sb, arg);
> > up_read(&sb->s_umount);
> > @@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
> > * locked superblock and given argument.
> > */
> > void iterate_supers_type(struct file_system_type *type,
> > - void (*f)(struct super_block *, void *), void *arg)
> > + void (*f)(struct super_block *, void *), void *arg,
> > + bool wait_for_unfreeze)
> > {
> > struct super_block *sb, *p = NULL;
> >
> > @@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> >
> > - down_read(&sb->s_umount);
> > + if (!wait_for_unfreeze)
> > + down_read(&sb->s_umount);
> > + else
> > + down_read_s_umount(sb);
> > if (sb->s_root)
> > f(sb, arg);
> > up_read(&sb->s_umount);
> > @@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
> > * mounted on the device given. %NULL is returned if no match is found.
> > */
> >
> > -struct super_block *get_super(struct block_device *bdev)
> > +struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
> > {
> > struct super_block *sb;
> >
> > @@ -612,7 +620,10 @@ rescan:
> > if (sb->s_bdev == bdev) {
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> > - down_read(&sb->s_umount);
> > + if (wait_for_unfreeze)
> > + down_read_s_umount(sb);
> > + else
> > + down_read(&sb->s_umount);
> > /* still alive? */
> > if (sb->s_root)
> > return sb;
> > @@ -672,7 +683,7 @@ rescan:
> > if (sb->s_dev == dev) {
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> > - down_read(&sb->s_umount);
> > + down_read_s_umount(sb);
> > /* still alive? */
> > if (sb->s_root)
> > return sb;
> > Index: linux-3.2-rc3-fast/fs/sync.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/sync.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/sync.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
> > */
> > static void sync_filesystems(int wait)
> > {
> > - iterate_supers(sync_one_sb, &wait);
> > + iterate_supers(sync_one_sb, &wait, true);
> > }
> >
> > /*
> > @@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
> > return -EBADF;
> > sb = file->f_dentry->d_sb;
> >
> > - down_read(&sb->s_umount);
> > + down_read_s_umount(sb);
> > ret = sync_filesystem(sb);
> > up_read(&sb->s_umount);
> >
> > Index: linux-3.2-rc3-fast/fs/block_dev.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/block_dev.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/block_dev.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
> > */
> > int fsync_bdev(struct block_device *bdev)
> > {
> > - struct super_block *sb = get_super(bdev);
> > + struct super_block *sb = get_super(bdev, true);
> > if (sb) {
> > int res = sync_filesystem(sb);
> > drop_super(sb);
> > @@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
> > * to freeze_bdev grab an active reference and only the last
> > * thaw_bdev drops it.
> > */
> > - sb = get_super(bdev);
> > + sb = get_super(bdev, false);
> > drop_super(sb);
> > mutex_unlock(&bdev->bd_fsfreeze_mutex);
> > return sb;
> > @@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);
> >
> > int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> > {
> > - struct super_block *sb = get_super(bdev);
> > + struct super_block *sb = get_super(bdev, true);
> > int res = 0;
> >
> > if (sb) {
> > Index: linux-3.2-rc3-fast/fs/drop_caches.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/drop_caches.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/drop_caches.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
> > return ret;
> > if (write) {
> > if (sysctl_drop_caches & 1)
> > - iterate_supers(drop_pagecache_sb, NULL);
> > + iterate_supers(drop_pagecache_sb, NULL, true);
> > if (sysctl_drop_caches & 2)
> > drop_slab();
> > }
> > Index: linux-3.2-rc3-fast/security/selinux/hooks.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/security/selinux/hooks.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/security/selinux/hooks.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -5693,7 +5693,7 @@ void selinux_complete_init(void)
> >
> > /* Set up any superblocks initialized prior to the policy load. */
> > printk(KERN_DEBUG "SELinux: Setting up existing superblocks.\n");
> > - iterate_supers(delayed_superblock_init, NULL);
> > + iterate_supers(delayed_superblock_init, NULL, true);
> > }
> >
> > /* SELinux requires early initialization in order to label
> > Index: linux-3.2-rc3-fast/fs/fs-writeback.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/fs-writeback.c 2011-11-29 00:09:30.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/fs-writeback.c 2011-11-29 00:12:49.000000000 +0100
> > @@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
> > {
> > if (!writeback_in_progress(sb->s_bdi)) {
> > down_read(&sb->s_umount);
> > + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > + up_read(&sb->s_umount);
> > + return 0;
> > + }
> > writeback_inodes_sb(sb, reason);
> > up_read(&sb->s_umount);
> > return 1;
> > @@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
> > {
> > if (!writeback_in_progress(sb->s_bdi)) {
> > down_read(&sb->s_umount);
> > + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > + up_read(&sb->s_umount);
> > + return 0;
> > + }
> > writeback_inodes_sb_nr(sb, nr, reason);
> > up_read(&sb->s_umount);
> > return 1;
> --
> Jan Kara <jack@xxxxxxx>
> SUSE Labs, CR
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/