Re: [Ocfs2-devel] [PATCH] ocfs2: reflink deadlock when clone file to the same directory simultaneously

From: Wengang Wang
Date: Fri Jul 30 2021 - 12:36:01 EST


Hi Gang,

> On Jul 29, 2021, at 11:16 PM, Gang He <ghe@xxxxxxxx> wrote:
>
> Hello Wengang and all,
>
> This issue can be reproduced stably when you run the below reflink command line(maybe you also can follow a "rm this file" command line and sleep some usecs) from each node repeatedly for a while.
> Based on my observation, the reflink processes are always blocked at the below points.
> From dlm_tool output and crash analysis, the node1 has acquired .snapshots directory inode EX dlm lock, but the reflink process is blocked at ocfs2_init_security_and_acl+0xbe/0x1d0 to acqure it's inode dlm lock again.

So the reflink path on node 1 have .snapshots inode lock granted and is blocking at the new created inode under orphan directory. BTW, what’s .snapshots directory? What’s the call path to lock that .snapshots inode?

> On the other two nodes, the reflink processes are blocked at acquire .snapshots directory inode dlm lock, then the whole file system is hung,
> you can not list this file again.

So there are reflink paths on the other two nodes blocking at .snapshots inode. But what lock they are granted already?

For a typical ABBA deadlock,
path 1 granted lock A and blocks at lock B
path 2 granted lock B and blocks at lock A

Per your description, I see this:
reflink path on node1 granted .snapshots lock and blocks at new inode lock
reflnk paths on onde2/3 block at .snapshots lock.

I don't see how deadlock formed… the new inode lock is granted to any of the reflink path on node2/3? how?

thanks,
wengang

>
> The problem looks like acquiring the destination direcory multiple during ocfs2_reflink, dlm glue layer cannot downconvert lock in some case.
> e.g.
> kernel: (ocfs2dc-F50B203,1593,0):ocfs2_downconvert_lock:3674 ERROR: DLM error -16 while calling ocfs2_dlm_lock on resource M000000000000000004661c00000000
> kernel: (ocfs2dc-F50B203,1593,0):ocfs2_unblock_lock:3918 ERROR: status = -16
> kernel: (ocfs2dc-F50B203,1593,0):ocfs2_process_blocked_lock:4317 ERROR: status = -16
>
> Then, I change the code to acquire this destination direcory dlm lock, and hold the lock until the end of ocfs2_reflink function.
> After this change, I did not encounter this hang problem again after lots of testing. Second, I find the code change also improve reflink performance, since the code avoids the previous ping-pong effect.
>
> Thanks
> Gang
>
>
> On 2021/7/30 6:07, Wengang Wang wrote:
>> Hi Gang,
>> I’d suggest you list the call paths on the related nodes, Say call path 1 on node one granted lock A and is requesting for lock B, at the same time, path2 on node two granted lock B and now is requesting for lock A.
>> With that, the problem would be easier to understand.
>> thanks,
>> wengang
>>> On Jul 29, 2021, at 4:02 AM, Gang He <ghe@xxxxxxxx> wrote:
>>>
>>> Running reflink from multiple nodes simultaneously to clone a file
>>> to the same directory probably triggers a deadlock issue.
>>> For example, there is a three node ocfs2 cluster, each node mounts
>>> the ocfs2 file system to /mnt/shared, and run the reflink command
>>> from each node repeatedly, like
>>> reflink "/mnt/shared/test" \
>>> "/mnt/shared/.snapshots/test.`date +%m%d%H%M%S`.`hostname`"
>>> then, reflink command process will be hung on each node, and you
>>> can't list this file system directory.
>>> The problematic reflink command process is blocked at one node,
>>> task:reflink state:D stack: 0 pid: 1283 ppid: 4154
>>> Call Trace:
>>> __schedule+0x2fd/0x750
>>> schedule+0x2f/0xa0
>>> schedule_timeout+0x1cc/0x310
>>> ? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
>>> ? 0xffffffffc0e3e000
>>> wait_for_completion+0xba/0x140
>>> ? wake_up_q+0xa0/0xa0
>>> __ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
>>> ? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
>>> ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
>>> ocfs2_init_security_and_acl+0xbe/0x1d0 [ocfs2]
>>> ocfs2_reflink+0x436/0x4c0 [ocfs2]
>>> ? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
>>> ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
>>> ocfs2_ioctl+0x25e/0x670 [ocfs2]
>>> do_vfs_ioctl+0xa0/0x680
>>> ksys_ioctl+0x70/0x80
>>> __x64_sys_ioctl+0x16/0x20
>>> do_syscall_64+0x5b/0x1e0
>>> The other reflink command processes are blocked at other nodes,
>>> task:reflink state:D stack: 0 pid:29759 ppid: 4088
>>> Call Trace:
>>> __schedule+0x2fd/0x750
>>> schedule+0x2f/0xa0
>>> schedule_timeout+0x1cc/0x310
>>> ? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
>>> ? 0xffffffffc0b19000
>>> wait_for_completion+0xba/0x140
>>> ? wake_up_q+0xa0/0xa0
>>> __ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
>>> ? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
>>> ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
>>> ocfs2_mv_orphaned_inode_to_new+0x87/0x7e0 [ocfs2]
>>> ocfs2_reflink+0x335/0x4c0 [ocfs2]
>>> ? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
>>> ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
>>> ocfs2_ioctl+0x25e/0x670 [ocfs2]
>>> do_vfs_ioctl+0xa0/0x680
>>> ksys_ioctl+0x70/0x80
>>> __x64_sys_ioctl+0x16/0x20
>>> do_syscall_64+0x5b/0x1e0
>>> or
>>> task:reflink state:D stack: 0 pid:18465 ppid: 4156
>>> Call Trace:
>>> __schedule+0x302/0x940
>>> ? usleep_range+0x80/0x80
>>> schedule+0x46/0xb0
>>> schedule_timeout+0xff/0x140
>>> ? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
>>> ? 0xffffffffc0c3b000
>>> __wait_for_common+0xb9/0x170
>>> __ocfs2_cluster_lock.constprop.0+0x1d6/0x860 [ocfs2]
>>> ? ocfs2_wait_for_recovery+0x49/0xd0 [ocfs2]
>>> ? ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
>>> ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
>>> ocfs2_inode_lock_tracker+0xf2/0x2b0 [ocfs2]
>>> ? dput+0x32/0x2f0
>>> ocfs2_permission+0x45/0xe0 [ocfs2]
>>> inode_permission+0xcc/0x170
>>> link_path_walk.part.0.constprop.0+0x2a2/0x380
>>> ? path_init+0x2c1/0x3f0
>>> path_parentat+0x3c/0x90
>>> filename_parentat+0xc1/0x1d0
>>> ? filename_lookup+0x138/0x1c0
>>> filename_create+0x43/0x160
>>> ocfs2_reflink_ioctl+0xe6/0x380 [ocfs2]
>>> ocfs2_ioctl+0x1ea/0x2c0 [ocfs2]
>>> ? do_sys_openat2+0x81/0x150
>>> __x64_sys_ioctl+0x82/0xb0
>>> do_syscall_64+0x61/0xb0
>>>
>>> The deadlock is caused by multiple acquiring the destination directory
>>> inode dlm lock in ocfs2_reflink function, we should acquire this
>>> directory inode dlm lock at the beginning, and hold this dlm lock until
>>> end of the function.
>>>
>>> Signed-off-by: Gang He <ghe@xxxxxxxx>
>>> ---
>>> fs/ocfs2/namei.c | 32 +++++++++++++-------------------
>>> fs/ocfs2/namei.h | 2 ++
>>> fs/ocfs2/refcounttree.c | 15 +++++++++++----
>>> fs/ocfs2/xattr.c | 12 +-----------
>>> fs/ocfs2/xattr.h | 1 +
>>> 5 files changed, 28 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
>>> index 2c46ff6ba4ea..f8bbb22cc60b 100644
>>> --- a/fs/ocfs2/namei.c
>>> +++ b/fs/ocfs2/namei.c
>>> @@ -2489,6 +2489,7 @@ static int ocfs2_prep_new_orphaned_file(struct inode *dir,
>>> }
>>>
>>> int ocfs2_create_inode_in_orphan(struct inode *dir,
>>> + struct buffer_head **dir_bh,
>>> int mode,
>>> struct inode **new_inode)
>>> {
>>> @@ -2597,13 +2598,16 @@ int ocfs2_create_inode_in_orphan(struct inode *dir,
>>>
>>> brelse(new_di_bh);
>>>
>>> - if (!status)
>>> - *new_inode = inode;
>>> -
>>> ocfs2_free_dir_lookup_result(&orphan_insert);
>>>
>>> - ocfs2_inode_unlock(dir, 1);
>>> - brelse(parent_di_bh);
>>> + if (!status) {
>>> + *new_inode = inode;
>>> + *dir_bh = parent_di_bh;
>>> + } else {
>>> + ocfs2_inode_unlock(dir, 1);
>>> + brelse(parent_di_bh);
>>> + }
>>> +
>>> return status;
>>> }
>>>
>>> @@ -2760,11 +2764,11 @@ int ocfs2_del_inode_from_orphan(struct ocfs2_super *osb,
>>> }
>>>
>>> int ocfs2_mv_orphaned_inode_to_new(struct inode *dir,
>>> + struct buffer_head *dir_bh,
>>> struct inode *inode,
>>> struct dentry *dentry)
>>> {
>>> int status = 0;
>>> - struct buffer_head *parent_di_bh = NULL;
>>> handle_t *handle = NULL;
>>> struct ocfs2_super *osb = OCFS2_SB(dir->i_sb);
>>> struct ocfs2_dinode *dir_di, *di;
>>> @@ -2778,14 +2782,7 @@ int ocfs2_mv_orphaned_inode_to_new(struct inode *dir,
>>> (unsigned long long)OCFS2_I(dir)->ip_blkno,
>>> (unsigned long long)OCFS2_I(inode)->ip_blkno);
>>>
>>> - status = ocfs2_inode_lock(dir, &parent_di_bh, 1);
>>> - if (status < 0) {
>>> - if (status != -ENOENT)
>>> - mlog_errno(status);
>>> - return status;
>>> - }
>>> -
>>> - dir_di = (struct ocfs2_dinode *) parent_di_bh->b_data;
>>> + dir_di = (struct ocfs2_dinode *) dir_bh->b_data;
>>> if (!dir_di->i_links_count) {
>>> /* can't make a file in a deleted directory. */
>>> status = -ENOENT;
>>> @@ -2798,7 +2795,7 @@ int ocfs2_mv_orphaned_inode_to_new(struct inode *dir,
>>> goto leave;
>>>
>>> /* get a spot inside the dir. */
>>> - status = ocfs2_prepare_dir_for_insert(osb, dir, parent_di_bh,
>>> + status = ocfs2_prepare_dir_for_insert(osb, dir, dir_bh,
>>> dentry->d_name.name,
>>> dentry->d_name.len, &lookup);
>>> if (status < 0) {
>>> @@ -2862,7 +2859,7 @@ int ocfs2_mv_orphaned_inode_to_new(struct inode *dir,
>>> ocfs2_journal_dirty(handle, di_bh);
>>>
>>> status = ocfs2_add_entry(handle, dentry, inode,
>>> - OCFS2_I(inode)->ip_blkno, parent_di_bh,
>>> + OCFS2_I(inode)->ip_blkno, dir_bh,
>>> &lookup);
>>> if (status < 0) {
>>> mlog_errno(status);
>>> @@ -2886,10 +2883,7 @@ int ocfs2_mv_orphaned_inode_to_new(struct inode *dir,
>>> iput(orphan_dir_inode);
>>> leave:
>>>
>>> - ocfs2_inode_unlock(dir, 1);
>>> -
>>> brelse(di_bh);
>>> - brelse(parent_di_bh);
>>> brelse(orphan_dir_bh);
>>>
>>> ocfs2_free_dir_lookup_result(&lookup);
>>> diff --git a/fs/ocfs2/namei.h b/fs/ocfs2/namei.h
>>> index 9cc891eb874e..03a2c526e2c1 100644
>>> --- a/fs/ocfs2/namei.h
>>> +++ b/fs/ocfs2/namei.h
>>> @@ -24,6 +24,7 @@ int ocfs2_orphan_del(struct ocfs2_super *osb,
>>> struct buffer_head *orphan_dir_bh,
>>> bool dio);
>>> int ocfs2_create_inode_in_orphan(struct inode *dir,
>>> + struct buffer_head **dir_bh,
>>> int mode,
>>> struct inode **new_inode);
>>> int ocfs2_add_inode_to_orphan(struct ocfs2_super *osb,
>>> @@ -32,6 +33,7 @@ int ocfs2_del_inode_from_orphan(struct ocfs2_super *osb,
>>> struct inode *inode, struct buffer_head *di_bh,
>>> int update_isize, loff_t end);
>>> int ocfs2_mv_orphaned_inode_to_new(struct inode *dir,
>>> + struct buffer_head *dir_bh,
>>> struct inode *new_inode,
>>> struct dentry *new_dentry);
>>>
>>> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
>>> index 7f6355cbb587..a9a0c7c37e8e 100644
>>> --- a/fs/ocfs2/refcounttree.c
>>> +++ b/fs/ocfs2/refcounttree.c
>>> @@ -4250,7 +4250,7 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
>>> {
>>> int error, had_lock;
>>> struct inode *inode = d_inode(old_dentry);
>>> - struct buffer_head *old_bh = NULL;
>>> + struct buffer_head *old_bh = NULL, *dir_bh = NULL;
>>> struct inode *new_orphan_inode = NULL;
>>> struct ocfs2_lock_holder oh;
>>>
>>> @@ -4258,7 +4258,7 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
>>> return -EOPNOTSUPP;
>>>
>>>
>>> - error = ocfs2_create_inode_in_orphan(dir, inode->i_mode,
>>> + error = ocfs2_create_inode_in_orphan(dir, &dir_bh, inode->i_mode,
>>> &new_orphan_inode);
>>> if (error) {
>>> mlog_errno(error);
>>> @@ -4304,13 +4304,15 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
>>>
>>> /* If the security isn't preserved, we need to re-initialize them. */
>>> if (!preserve) {
>>> - error = ocfs2_init_security_and_acl(dir, new_orphan_inode,
>>> + error = ocfs2_init_security_and_acl(dir, dir_bh,
>>> + new_orphan_inode,
>>> &new_dentry->d_name);
>>> if (error)
>>> mlog_errno(error);
>>> }
>>> if (!error) {
>>> - error = ocfs2_mv_orphaned_inode_to_new(dir, new_orphan_inode,
>>> + error = ocfs2_mv_orphaned_inode_to_new(dir, dir_bh,
>>> + new_orphan_inode,
>>> new_dentry);
>>> if (error)
>>> mlog_errno(error);
>>> @@ -4328,6 +4330,11 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
>>> iput(new_orphan_inode);
>>> }
>>>
>>> + if (dir_bh) {
>>> + ocfs2_inode_unlock(dir, 1);
>>> + brelse(dir_bh);
>>> + }
>>> +
>>> return error;
>>> }
>>>
>>> diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
>>> index dd784eb0cd7c..3f23e3a5018c 100644
>>> --- a/fs/ocfs2/xattr.c
>>> +++ b/fs/ocfs2/xattr.c
>>> @@ -7203,16 +7203,13 @@ int ocfs2_reflink_xattrs(struct inode *old_inode,
>>> /*
>>> * Initialize security and acl for a already created inode.
>>> * Used for reflink a non-preserve-security file.
>>> - *
>>> - * It uses common api like ocfs2_xattr_set, so the caller
>>> - * must not hold any lock expect i_mutex.
>>> */
>>> int ocfs2_init_security_and_acl(struct inode *dir,
>>> + struct buffer_head *dir_bh,
>>> struct inode *inode,
>>> const struct qstr *qstr)
>>> {
>>> int ret = 0;
>>> - struct buffer_head *dir_bh = NULL;
>>>
>>> ret = ocfs2_init_security_get(inode, dir, qstr, NULL);
>>> if (ret) {
>>> @@ -7220,17 +7217,10 @@ int ocfs2_init_security_and_acl(struct inode *dir,
>>> goto leave;
>>> }
>>>
>>> - ret = ocfs2_inode_lock(dir, &dir_bh, 0);
>>> - if (ret) {
>>> - mlog_errno(ret);
>>> - goto leave;
>>> - }
>>> ret = ocfs2_init_acl(NULL, inode, dir, NULL, dir_bh, NULL, NULL);
>>> if (ret)
>>> mlog_errno(ret);
>>>
>>> - ocfs2_inode_unlock(dir, 0);
>>> - brelse(dir_bh);
>>> leave:
>>> return ret;
>>> }
>>> diff --git a/fs/ocfs2/xattr.h b/fs/ocfs2/xattr.h
>>> index 00308b57f64f..b27fd8ba0019 100644
>>> --- a/fs/ocfs2/xattr.h
>>> +++ b/fs/ocfs2/xattr.h
>>> @@ -83,6 +83,7 @@ int ocfs2_reflink_xattrs(struct inode *old_inode,
>>> struct buffer_head *new_bh,
>>> bool preserve_security);
>>> int ocfs2_init_security_and_acl(struct inode *dir,
>>> + struct buffer_head *dir_bh,
>>> struct inode *inode,
>>> const struct qstr *qstr);
>>> #endif /* OCFS2_XATTR_H */
>>> --
>>> 2.21.0
>>>
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel@xxxxxxxxxxxxxx
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>