Re: [syzbot] [btrfs?] possible deadlock in __btrfs_release_delayed_node (3)

From: Boqun Feng
Date: Sat Sep 02 2023 - 17:55:33 EST


On Wed, Aug 30, 2023 at 07:07:16PM +0800, Hillf Danton wrote:
> On Mon, 28 Aug 2023 16:03:59 -0700
> > syzbot found the following issue on:
> >
> > HEAD commit: 93f5de5f648d Merge tag 'acpi-6.5-rc8' of git://git.kernel...
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17f71340680000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=1b32f62c755c3a9c
> > dashboard link: https://syzkaller.appspot.com/bug?extid=a379155f07c134ea9879
> > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12684fa7a80000
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/9f3fa40677fd/disk-93f5de5f.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/1aa00d4d7c58/vmlinux-93f5de5f.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/361b7a3f46b3/bzImage-93f5de5f.xz
> > mounted in repro: https://storage.googleapis.com/syzbot-assets/696d28540778/mount_0.gz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+a379155f07c134ea9879@xxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > BTRFS info (device loop2): enabling ssd optimizations
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Not tainted
> > ------------------------------------------------------
> > syz-executor.2/13257 is trying to acquire lock:
^^^^^^^^^^^^^^^^^

The warning actually happened at acquire time.

> > ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
> >
> > but task is already holding lock:
> > ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #1 (btrfs-tree-00){++++}-{3:3}:
> > __lock_release kernel/locking/lockdep.c:5475 [inline]
> > lock_release+0x36f/0x9d0 kernel/locking/lockdep.c:5781
> > up_write+0x79/0x580 kernel/locking/rwsem.c:1625
>
> Why was warned at release time instead of acquire time?
>

If you look carefully, this is not the stack when the issue happened,
this is the stack when a previous lock dependency was built, now the
question is why lockdep records the stack trace at release time? To
answer this, please consider the following case:

spin_lock(&A);
spin_lock(&B); // <- step #1
spin_lock(&C); // <- step #2

spin_unlock(&B); // <-step #3
spin_unlock(&C); // <-step #4
spin_unlock(&A);

(note that lock order and unlock order of B and C are not symmetric)

At step #1, the lockdep lock graph will record the dependency:

A -> B along with the stack trace at step #1

At step #2, the lockdep lock graph will record the dependency:

B -> C along with the stack trace at step #2

so at step #2, there are *two* dependencies in the graph:

A -> B
B -> C

, and dependency A -> C can be inferred by transverse in the graph.
Without saving the dependency explicitly in the graph, we save some
memory on the dependency and the stack trace.

Now at step #3, the interesting part happens: lock_release() is
implemented as 1) removing B and rest of held locks, i.e. C from the
current lock held stack, and 2) *reacquiring* the rest of held locks,
i.e. C. The reacquiring will trigger lockdep to record the dependency:

A -> C along with the stack track at step #3 (release time!)

And if afterwards, someone does a:

spin_lock(&C);
spin_lock(&A);

lockdep will warn at spin_lock(&A); because A -> C is already in the
graph, and the report will show the stack trace associated with
dependency A -> C.

Of course if we reverse step #3 and #4, there will be no A -> C in the
graph, and when the same deadlock possibility happens, lockdep will
report with dependency chain A -> B -> C and stack traces at step #1 and
#2.

If you don't want this behavior (i.e. you want to use more memory by
default to more accurate stack traces), you can use the following patch:

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 111607d91489..b1a18b544d74 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3261,15 +3261,6 @@ check_prevs_add(struct task_struct *curr, struct held_lock *next)
int ret = check_prev_add(curr, hlock, next, distance, &trace);
if (!ret)
return 0;
-
- /*
- * Stop after the first non-trylock entry,
- * as non-trylock entries have added their
- * own direct dependencies already, so this
- * lock is connected to them indirectly:
- */
- if (!hlock->trylock)
- break;
}

depth--;

There definitely are smarter ways than this, your ideas are welcome!

Hope this helps.

Regards,
Boqun

> Hillf
> > btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
> > btrfs_unlock_up_safe+0x179/0x3b0 fs/btrfs/locking.c:239
> > search_leaf fs/btrfs/ctree.c:1986 [inline]
> > btrfs_search_slot+0x2511/0x2f80 fs/btrfs/ctree.c:2230
> > btrfs_insert_empty_items+0x9c/0x180 fs/btrfs/ctree.c:4376
> > btrfs_insert_delayed_item fs/btrfs/delayed-inode.c:746 [inline]
> > btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
> > __btrfs_commit_inode_delayed_items+0xd24/0x2410 fs/btrfs/delayed-inode.c:1111
> > __btrfs_run_delayed_items+0x1db/0x430 fs/btrfs/delayed-inode.c:1153
> > flush_space+0x269/0xe70 fs/btrfs/space-info.c:723
> > btrfs_async_reclaim_metadata_space+0x106/0x350 fs/btrfs/space-info.c:1078
> > process_one_work+0x92c/0x12c0 kernel/workqueue.c:2600
> > worker_thread+0xa63/0x1210 kernel/workqueue.c:2751
> > kthread+0x2b8/0x350 kernel/kthread.c:389
> > ret_from_fork+0x2e/0x60 arch/x86/kernel/process.c:145
> > ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
> >
> > -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
> > check_prev_add kernel/locking/lockdep.c:3142 [inline]
> > check_prevs_add kernel/locking/lockdep.c:3261 [inline]
> > validate_chain kernel/locking/lockdep.c:3876 [inline]
> > __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
> > lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
> > __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603
> > __mutex_lock kernel/locking/mutex.c:747 [inline]
> > mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799
> > __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
> > btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
> > __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156
> > btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276
> > btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988
> > vfs_fsync_range fs/sync.c:188 [inline]
> > vfs_fsync fs/sync.c:202 [inline]
> > do_fsync fs/sync.c:212 [inline]
> > __do_sys_fsync fs/sync.c:220 [inline]
> > __se_sys_fsync fs/sync.c:218 [inline]
> > __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218
> > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > other info that might help us debug this:
> >
> > Possible unsafe locking scenario:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(btrfs-tree-00);
> > lock(&delayed_node->mutex);
> > lock(btrfs-tree-00);
> > lock(&delayed_node->mutex);
> >
> > *** DEADLOCK ***
> >
> > 3 locks held by syz-executor.2/13257:
> > #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: spin_unlock include/linux/spinlock.h:391 [inline]
> > #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0xb87/0xe00 fs/btrfs/transaction.c:287
> > #1: ffff88802c1ee398 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0xbb2/0xe00 fs/btrfs/transaction.c:288
> > #2: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198