[PATCH 2/2] md: fix deadlock while suspending RAID array

From: Simon Guinot
Date: Tue Mar 11 2014 - 15:13:23 EST


Sometimes a deadlock happens while migrating a RAID array level, using
the mdadm --grow command. In the following example, an ext4 filesystem
is installed over a RAID1 array and mdadm is used to transform this
array into a RAID5 one. Here are the observed backtraces for the locked
tasks:

jbd2/dm-0-8 D c0478384 0 9100 2 0x00000000
[<c0478384>] (__schedule+0x154/0x320) from [<c0157c68>] (jbd2_journal_commit_transaction+0x1b0/0x132c)
[<c0157c68>] (jbd2_journal_commit_transaction+0x1b0/0x132c) from [<c015b5f4>] (kjournald2+0x9c/0x200)
[<c015b5f4>] (kjournald2+0x9c/0x200) from [<c003558c>] (kthread+0xa4/0xb0)
[<c003558c>] (kthread+0xa4/0xb0) from [<c000df18>] (ret_from_fork+0x14/0x3c)
ext4lazyinit D c0478384 0 9113 2 0x00000000
[<c0478384>] (__schedule+0x154/0x320) from [<c0364ed0>] (md_write_start+0xd8/0x194)
[<c0364ed0>] (md_write_start+0xd8/0x194) from [<bf004c80>] (make_request+0x3c/0xc5c [raid1])
[<bf004c80>] (make_request+0x3c/0xc5c [raid1]) from [<c0363784>] (md_make_request+0xe4/0x1f8)
[<c0363784>] (md_make_request+0xe4/0x1f8) from [<c025f544>] (generic_make_request+0xa8/0xc8)
[<c025f544>] (generic_make_request+0xa8/0xc8) from [<c025f5e4>] (submit_bio+0x80/0x12c)
[<c025f5e4>] (submit_bio+0x80/0x12c) from [<c0265648>] (__blkdev_issue_zeroout+0x134/0x1a0)
[<c0265648>] (__blkdev_issue_zeroout+0x134/0x1a0) from [<c0265748>] (blkdev_issue_zeroout+0x94/0xa0)
[<c0265748>] (blkdev_issue_zeroout+0x94/0xa0) from [<c011a3e8>] (ext4_init_inode_table+0x178/0x2cc)
[<c011a3e8>] (ext4_init_inode_table+0x178/0x2cc) from [<c0129fac>] (ext4_lazyinit_thread+0xe8/0x288)
[<c0129fac>] (ext4_lazyinit_thread+0xe8/0x288) from [<c003558c>] (kthread+0xa4/0xb0)
[<c003558c>] (kthread+0xa4/0xb0) from [<c000df18>] (ret_from_fork+0x14/0x3c)
mdadm D c0478384 0 10163 9465 0x00000000
[<c0478384>] (__schedule+0x154/0x320) from [<c0362edc>] (mddev_suspend+0x68/0xc0)
[<c0362edc>] (mddev_suspend+0x68/0xc0) from [<c0363080>] (level_store+0x14c/0x59c)
[<c0363080>] (level_store+0x14c/0x59c) from [<c03665ac>] (md_attr_store+0xac/0xdc)
[<c03665ac>] (md_attr_store+0xac/0xdc) from [<c00eee38>] (sysfs_write_file+0x100/0x168)
[<c00eee38>] (sysfs_write_file+0x100/0x168) from [<c0098598>] (vfs_write+0xb8/0x184)
[<c0098598>] (vfs_write+0xb8/0x184) from [<c009893c>] (SyS_write+0x40/0x6c)
[<c009893c>] (SyS_write+0x40/0x6c) from [<c000de80>] (ret_fast_syscall+0x0/0x30)

This deadlock can be reproduced on different architecture (ARM and x86)
and also with different Linux kernel versions: 3.14-rc and 3.10 stable.

The problem comes from the mddev_suspend() function which don't allow
mddev->thread to complete the pending I/Os (mddev->active_io) if any:

1. mdadm holds mddev->reconfig_mutex before running mddev_suspend().
If a write I/O is submitted while mdadm holds the mutex and when the
RAID array is still not suspended, then mddev->thread is not able to
complete the I/O: The superblock can't be updated because
mddev->reconfig_mutex is not available. Note that having a write I/O
over a "not suspended yet" RAID array is not a marginal scenario:
To load a new RAID personality, level_store() calls request_module()
which is allowed to schedule. Moreover on a SMP or a preemptible
kernel, the odds are probably even greater.

2. In a same way, mddev_suspend() sets the mddev->suspended flag. Again
this may prevent mddev->thread to complete some pending I/Os when
a superblock update is needed: md_check_recovery, used by the RAID
threads, does nothing but exits when the mddev->suspended flag is
set. As a consequence the superblock is never updated.

This patch solves this issues by ensuring there is no pending active
I/Os before suspending effectively a RAID array.

Signed-off-by: Simon Guinot <simon.guinot@xxxxxxxxxxxx>
Tested-by: RÃmi RÃrolle <remi.rerolle@xxxxxxxxxxx>
---
drivers/md/md.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fb4296adae80..ea3e95d1972b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -375,9 +375,22 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
void mddev_suspend(struct mddev *mddev)
{
BUG_ON(mddev->suspended);
- mddev->suspended = 1;
- synchronize_rcu();
- wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
+
+ for (;;) {
+ mddev->suspended = 1;
+ synchronize_rcu();
+ if (atomic_read(&mddev->active_io) == 0)
+ break;
+ mddev->suspended = 0;
+ synchronize_rcu();
+ /*
+ * Note that mddev_unlock is also used to wake up mddev->thread.
+ * This allows to complete the pending mddev->active_io.
+ */
+ mddev_unlock(mddev);
+ wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
+ mddev_lock_nointr(mddev);
+ }
mddev->pers->quiesce(mddev, 1);

del_timer_sync(&mddev->safemode_timer);
--
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/