Re: [6.2][regression] after commit 947a629988f191807d2d22ba63ae18259bb645c5 btrfs volume periodical forced switch to readonly after a lot of disk writes

From: Qu Wenruo
Date: Mon Dec 26 2022 - 03:47:37 EST




On 2022/12/26 16:15, Mikhail Gavrilov wrote:
On Mon, Dec 26, 2022 at 8:29 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:


OK, indeed a level mismatch.

From the remaining lines, it shows we're failing at
do_free_extent_accounting(), which failed at the btrfs_del_csums().

And inside btrfs_del_csums(), what we do are all regular btree
operations, thus the tree level check should work without problem.

Thus it seems to be a corrupted csum tree.

Do I need to debug anything else to understand the cause of the error?
Thanks.

With the check output, it's indeed a runtime error.
(At least no corruption to your fs)

And it can be some call paths not properly initializing the level to check.

Here is the new debug patch.
It should be applied without any previous debug patch.

Thanks,
Qu



Could you please run "btrfs check --readonly" from a liveCD?
There are tons of possible false alerts if ran on a RW mounted fs.


# btrfs check --readonly /dev/nvme0n1p3
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p3
UUID: 40e0b5d2-df54-46e0-b6f4-2f868296271d
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 6828416307200 bytes used, no error found
total csum bytes: 6651838248
total tree bytes: 16378380288
total fs tree bytes: 7483179008
total extent tree bytes: 1228210176
btree space waste bytes: 2413299694
file data blocks allocated: 6899999100928
referenced 7488299450368
[root@localhost-live ~]#

With liveCD looks like all OK (no errors found).
From c9932d40594da6065125b76b55bd9cea1fabc812 Mon Sep 17 00:00:00 2001
Message-Id: <c9932d40594da6065125b76b55bd9cea1fabc812.1672044392.git.wqu@xxxxxxxx>
From: Qu Wenruo <wqu@xxxxxxxx>
Date: Mon, 26 Dec 2022 16:44:08 +0800
Subject: [PATCH] btrfs: add extra debug for level mismatch

Currently I assume there is some race or uninitialized value for
check::level.

The extra output are for two locations:

- validate_extent_buffer()
Output the error message for read error and the members of check.

- read_extent_buffer_pages()
This will dump the stack for us to catch the offender.

Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
fs/btrfs/disk-io.c | 15 +++++++++++++--
fs/btrfs/extent_io.c | 12 +++++++++++-
2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f8b5955f003f..62e6ad909b19 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -530,6 +530,10 @@ static int validate_extent_buffer(struct extent_buffer *eb,
}

if (found_level != check->level) {
+ btrfs_err(eb->fs_info,
+"level verify failed on logical %llu mirror %u wanted %u found %u",
+ eb->start, eb->read_mirror, check->level,
+ found_level);
ret = -EIO;
goto out;
}
@@ -581,13 +585,20 @@ static int validate_extent_buffer(struct extent_buffer *eb,
if (found_level > 0 && btrfs_check_node(eb))
ret = -EIO;

+out:
if (!ret)
set_extent_buffer_uptodate(eb);
- else
+ else {
btrfs_err(fs_info,
"read time tree block corruption detected on logical %llu mirror %u",
eb->start, eb->read_mirror);
-out:
+ btrfs_err(eb->fs_info,
+"check owner_root=%llu transid=%llu first_key=(%llu %u %llu) has_first_key=%d level=%u",
+ check->owner_root,
+ check->transid, check->first_key.objectid,
+ check->first_key.type, check->first_key.offset,
+ check->has_first_key, check->level);
+ }
return ret;
}

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 83dd3aa59663..5f267345ef94 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5005,8 +5005,18 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num,
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];
wait_on_page_locked(page);
- if (!PageUptodate(page))
+ if (!PageUptodate(page)) {
ret = -EIO;
+ btrfs_err(eb->fs_info,
+"read failed, check owner_root=%llu transid=%llu has_first_key=%d first_key=(%llu %u %llu) level=%u",
+ check->owner_root, check->transid,
+ check->has_first_key,
+ check->first_key.objectid,
+ check->first_key.type,
+ check->first_key.offset,
+ check->level);
+ dump_stack();
+ }
}

return ret;
--
2.39.0