Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails

From: Yu Kuai
Date: Thu Aug 14 2025 - 21:27:39 EST

Next message: H. Peter Anvin: "Re: [RFC][PATCH] x86,ibt: Use UDB instead of 0xEA"
Previous message: Yu Kuai: "Re: [PATCH v2 md-6.17] md: rename recovery_cp to resync_offset"
In reply to: Kenta Akagi: "Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails"
Next in thread: Kenta Akagi: "Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

在 2025/08/14 23:54, Kenta Akagi 写道:

On 2025/08/13 9:59, Yu Kuai wrote:

Hi,

在 2025/08/12 17:01, Kenta Akagi 写道:

It is not intended for the array to fail when a metadata write with
MD_FAILFAST fails.
After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
when md_error is called on the last device in RAID1/10,
the MD_BROKEN flag is set on the array.
Because of this, a failfast metadata write failure will
make the array "broken" state.

If rdev is not Faulty even after calling md_error,
the rdev is the last device, and there is nothing except
MD_BROKEN that prevents writes to the array.
Therefore, by clearing MD_BROKEN, the array will not become
"broken" after a failfast metadata write failure.

I don't understand here, I think MD_BROKEN is expected, the last
rdev has IO error while updating metadata, the array is now broken
and you can only read it afterwards. Allow using this broken array
read-write might causing more severe problem like data loss.

Thank you for reviewing.

I think that only when the bio has the MD_FAILFAST flag,
a metadata write failure to the last rdev should not make it
broken array at that point.

This is because a metadata write with MD_FAILFAST is retried after
failure as follows:

1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.

2. In md_super_wait, which is called by the function that
executed md_super_write and waits for completion,
-EAGAIN is returned because MD_SB_NEED_REWRITE is set.

3. The caller of md_super_wait (such as md_update_sb)
receives a negative return value and then retries md_super_write.

4. The md_super_write function, which is called to perform
the same metadata write, issues a write bio
without MD_FAILFAST this time, because the rdev has LastDev flag.

When a bio from super_written without MD_FAILFAST fails,
the array is truly broken, and MD_BROKEN should be set.

A failfast bio, for example in the case of nvme-tcp ,
will fail immediately if the connection to the target is
lost for a few seconds and the device enters a reconnecting
state - even though it would recover if given a few seconds.
This behavior is exactly as intended by the design of failfast.

However, md treats super_write operations fails with failfast as fatal.
For example, if an initiator - that is, a machine loading the md module -
loses all connections for a few seconds, the array becomes
broken and subsequent write is no longer possible.
This is the issue I am currently facing, and which this patch aims to fix.

Should I add more context to the commit message? Please advise.

Yes, please explain in detail in commit message.

Thanks,
AKAGI

Thanks,
Kuai

Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
Signed-off-by: Kenta Akagi <k@xxxxxxx>
---
drivers/md/md.c | 1 +
drivers/md/md.h | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ac85ec73a409..3ec4abf02fa0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio)
          md_error(mddev, rdev);
          if (!test_bit(Faulty, &rdev->flags)
              && (bio->bi_opf & MD_FAILFAST)) {
+            clear_bit(MD_BROKEN, &mddev->flags);

And I feel a beeter way is to set MD_BROKEN only if the last rdev
failed, set it in middle and clear it is werid.

Thanks,
Kuai

              set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
              set_bit(LastDev, &rdev->flags);
          }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 51af29a03079..2f87bcc5d834 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -332,7 +332,7 @@ struct md_cluster_operations;
   *                   resync lock, need to release the lock.
   * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
   *                calls to md_error() will never cause the array to
- *                become failed.
+ *                become failed while fail_last_dev is not set.
   * @MD_HAS_PPL: The raid array has PPL feature set.
   * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
   * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that

.

Next message: H. Peter Anvin: "Re: [RFC][PATCH] x86,ibt: Use UDB instead of 0xEA"
Previous message: Yu Kuai: "Re: [PATCH v2 md-6.17] md: rename recovery_cp to resync_offset"
In reply to: Kenta Akagi: "Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails"
Next in thread: Kenta Akagi: "Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]