Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore

From: Oleg Nesterov
Date: Fri Aug 07 2015 - 15:58:09 EST


Jan, Dave, please help.

I'll try to re-check/re-test, but so far I think that this and the
previous series are correct. However 3/4 from the previous series
(attached at the end just in case) uncovers (I think) some problems
in xfs locking.

What should I do now?

On 07/22, Oleg Nesterov wrote:
>
> Testing. Well, so far I only verified that ioctl(FIFREEZE) +
> ioctl(FITHAW) seems to wors "as expected" on my testing machine
> with ext3. So probably this needs more testing.

Finally I got the testing machine and ran xfstests, I did

dd if=/dev/zero of=TEST.img bs=1MiB count=4000
dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000

losetup --find --show TEST.img
losetup --find --show SCRATCH.img

mkfs.xfs -f /dev/loop0
mkfs.xfs -f /dev/loop1

mkdir -p TEST SCRATCH

mount /dev/loop0 TEST
mount /dev/loop1 SCRATCH

TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
./check `grep -il freeze tests/*/???`

several times and every time the result looks like below:

FSTYP -- xfs (non-debug)
PLATFORM -- Linux/x86_64 intel-canoepass-10 4.1.0+
MKFS_OPTIONS -- -f -bsize=4096 /dev/loop1
MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH

generic/068 59s ... 61s
generic/085 23s ... 26s
generic/280 2s ... 3s
generic/311 169s ... 167s
xfs/011 21s ... 20s
xfs/119 32s ... 21s
xfs/297 455s ... 301s
Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
Passed all 7 tests

But with CONFIG_LOCKDEP=y generic/068 triggers the warning:

[ 66.092831] ======================================================
[ 66.099726] [ INFO: possible circular locking dependency detected ]
[ 66.106719] 4.1.0+ #2 Not tainted
[ 66.110414] -------------------------------------------------------
[ 66.117405] fsstress/2210 is trying to acquire lock:
[ 66.122944] (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0
[ 66.131637]
but task is already holding lock:
[ 66.138146] (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
[ 66.148022]
which lock already depends on the new lock.

[ 66.157141]
the existing dependency chain (in reverse order) is:
[ 66.165490]
-> #4 (&xfs_dir_ilock_class){++++..}:
[ 66.170974] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
[ 66.177596] [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70
[ 66.184605] [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs]
[ 66.191645] [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs]
[ 66.199638] [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs]
[ 66.206950] [<ffffffff81279411>] notify_change+0x271/0x3a0
[ 66.213762] [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200
[ 66.221251] [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0
[ 66.227576] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
[ 66.234878]
-> #3 (sb_internal#2){++++++}:
[ 66.239698] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
[ 66.246316] [<ffffffff81824256>] down_write+0x36/0x70
[ 66.252644] [<ffffffff81100979>] percpu_down_write+0x39/0x110
[ 66.259760] [<ffffffff8125901d>] freeze_super+0xbd/0x190
[ 66.266369] [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520
[ 66.273082] [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0
[ 66.279311] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
[ 66.286610]
-> #2 (sb_pagefaults#2){++++++}:
[ 66.291623] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
[ 66.298237] [<ffffffff811007c1>] percpu_down_read+0x51/0xa0
[ 66.305144] [<ffffffff8125979b>] __sb_start_write+0xdb/0x120
[ 66.312140] [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0
[ 66.319245] [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs]
[ 66.327533] [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100
[ 66.334433] [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0
[ 66.341533] [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
[ 66.348542] [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
[ 66.355172] [<ffffffff818286e8>] page_fault+0x28/0x30
[ 66.361493]
-> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}:
[ 66.367659] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
[ 66.374275] [<ffffffff810f4aef>] down_read_nested+0x3f/0x60
[ 66.381185] [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs]
[ 66.388212] [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs]
[ 66.395817] [<ffffffff81200a9e>] __do_fault+0x4e/0x100
[ 66.402239] [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0
[ 66.409340] [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
[ 66.416344] [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
[ 66.422959] [<ffffffff818286e8>] page_fault+0x28/0x30
[ 66.429283]
-> #0 (&mm->mmap_sem){++++++}:
[ 66.434093] [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
[ 66.441194] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
[ 66.447808] [<ffffffff8120058f>] might_fault+0x6f/0xa0
[ 66.454235] [<ffffffff8126e05a>] filldir+0x9a/0x130
[ 66.460373] [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
[ 66.469233] [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
[ 66.476449] [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
[ 66.483954] [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
[ 66.490473] [<ffffffff8126e361>] SyS_getdents+0x91/0x120
[ 66.497091] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
[ 66.504381]
other info that might help us debug this:

[ 66.513316] Chain exists of:
&mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class

[ 66.522619] Possible unsafe locking scenario:

[ 66.529222] CPU0 CPU1
[ 66.534275] ---- ----
[ 66.539327] lock(&xfs_dir_ilock_class);
[ 66.543823] lock(sb_internal#2);
[ 66.550465] lock(&xfs_dir_ilock_class);
[ 66.557767] lock(&mm->mmap_sem);
[ 66.561580]
*** DEADLOCK ***

[ 66.568186] 2 locks held by fsstress/2210:
[ 66.572753] #0: (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140
[ 66.583103] #1: (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
[ 66.593457]
stack backtrace:
[ 66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2
[ 66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[ 66.616372] 0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd
[ 66.624663] 0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26
[ 66.632955] 0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480
[ 66.641249] Call Trace:
[ 66.643983] [<ffffffff8181d9dd>] dump_stack+0x45/0x57
[ 66.649713] [<ffffffff810f7b26>] print_circular_bug+0x206/0x280
[ 66.656413] [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
[ 66.662919] [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100
[ 66.669523] [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
[ 66.675543] [<ffffffff81200562>] ? might_fault+0x42/0xa0
[ 66.681566] [<ffffffff8120058f>] might_fault+0x6f/0xa0
[ 66.687394] [<ffffffff81200562>] ? might_fault+0x42/0xa0
[ 66.693418] [<ffffffff8126e05a>] filldir+0x9a/0x130
[ 66.698968] [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs]
[ 66.706941] [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
[ 66.715203] [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs]
[ 66.721712] [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
[ 66.728316] [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480
[ 66.735998] [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
[ 66.742894] [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
[ 66.748819] [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0
[ 66.754938] [<ffffffff8126e361>] SyS_getdents+0x91/0x120
[ 66.760958] [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0
[ 66.766884] [<ffffffff8182662e>] system_call_fastpath+0x12/0x76

The chain reported by lockdep is not exactly the same every time,
but similar.

Once again, I'll recheck. but the patch below still looks correct
to me, and I think that it is actually a fix.

Before this patch freeze_super() calls sync_filesystem() and
s_op->freeze_fs(sb) without ->s_writers locks (as it seen by
lockdep) and this is wrong.

After this patch lockdep knows about ->s_writers locks held and this
is what we want, but apparently this way lockdep can notice the new
dependencies.

Oleg.

------------------------------------------------------------------------------
Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super()

Move the "fool the lockdep" code from sb_wait_write() into the new
simple helper, sb_lockdep_release(), called once before return from
freeze_super().

This is preparation, but imo this makes sense in any case. This way
we do not hide from lockdep the "write" locks we hold when we call
s_op->freeze_fs(sb).

Signed-off-by: Oleg Nesterov <oleg@xxxxxxxxxx>
---
fs/super.c | 16 ++++++++++------
1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index d0fdd49..e7ea9f1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level)
{
s64 writers;

- /*
- * We just cycle-through lockdep here so that it does not complain
- * about returning with lock to userspace
- */
rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
- rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);

do {
DEFINE_WAIT(wait);
-
/*
* We use a barrier in prepare_to_wait() to separate setting
* of frozen and checking of the counter
@@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level)
} while (writers);
}

+static void sb_freeze_release(struct super_block *sb)
+{
+ int level;
+ /* Avoid the warning from lockdep_sys_exit() */
+ for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+ rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
+}
+
/**
* freeze_super - lock the filesystem and force it into a consistent state
* @sb: the super to lock
@@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb)
sb->s_writers.frozen = SB_UNFROZEN;
smp_wmb();
wake_up(&sb->s_writers.wait_unfrozen);
+ sb_freeze_release(sb);
deactivate_locked_super(sb);
return ret;
}
@@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb)
* sees write activity when frozen is set to SB_FREEZE_COMPLETE.
*/
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+ sb_freeze_release(sb);
up_write(&sb->s_umount);
return 0;
}
--
1.5.5.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/