Re: [6.2][regression] after commit 947a629988f191807d2d22ba63ae18259bb645c5 btrfs volume periodical forced switch to readonly after a lot of disk writes

From: Qu Wenruo
Date: Tue Dec 27 2022 - 20:08:50 EST




On 2022/12/27 21:11, Mikhail Gavrilov wrote:
On Tue, Dec 27, 2022 at 4:03 PM Qu Wenruo <wqu@xxxxxxxx> wrote:

I have a similar laptop (G14), only GPU is different (RTX3060), and I
failed to reproduce this so far...

My gcc is only a small version behind (12.2.0).

Thus none of the hardware seems suspicious at all...

Anyway I have attached my last struggle for the weird problem.
For now, I have no idea why this can even happen...

The new Kernel log is attached.
This time, the main difference was that the file system did not
immediately switch to readonly.
The Steam client stopped a couple of times with a write error, but
after pressing the resume button, it resumed downloading. For the
third or fourth time refused to download.

I'm a total idiot.

From the very first dmesg with calltrack, it already shows the submit_one_bio() is called from submit_extent_page(), which means cases cross stripe boundary, and has no parent_check populated at all.

And since you're using RAID0 on two NVMEs, it matches the symptom, while most tests done here are using single device (DUP and SINGLE), thus no stripe boundary cases at all.
(In fact it should still be possible to trigger on SINGLE, but way too hard to trigger)

With proper root cause found, this version should mostly handle the regression correctly.

This version should mostly be the formal one I'd later send to the mailing list.

I can not thank you more for all the testing you have provided, it not only pinned down the bug, but also proves I'm a total idiot...

Thanks,
QuFrom cc05d22c04507440c644384d3b68def65653b0de Mon Sep 17 00:00:00 2001
Message-Id: <cc05d22c04507440c644384d3b68def65653b0de.1672189568.git.wqu@xxxxxxxx>
From: Qu Wenruo <wqu@xxxxxxxx>
Date: Wed, 28 Dec 2022 08:44:18 +0800
Subject: [PATCH] btrfs: fix the false alert on bad tree level

[BUG]
There is a bug report that on a RAID0 nvme btrfs system, under heavy
write load the fs can flip RO randomly.

With extra debug, it shows some tree blocks failed to pass its level
checks, and if that happens at critical path of a transaction, we abort
the transacation:

BTRFS error (device nvme0n1p3): level verify failed on logical 5446121209856 mirror 1 wanted 0 found 1
BTRFS error (device nvme0n1p3: state A): Transaction aborted (error -5)
BTRFS: error (device nvme0n1p3: state A) in btrfs_finish_ordered_io:3343: errno=-5 IO failure
BTRFS info (device nvme0n1p3: state EA): forced readonly

[CAUSE]
The reporter has already bisected to commit 947a629988f1 ("btrfs: move
tree block parentness check into validate_extent_buffer()").

And with extra debug, it shows we can have btrfs_tree_parent_check
filled with all zero in the following call trace:

<TASK>
submit_one_bio+0xd4/0xe0
submit_extent_page+0x142/0x550
read_extent_buffer_pages+0x584/0x9c0
? __pfx_end_bio_extent_readpage+0x10/0x10
? folio_unlock+0x1d/0x50
btrfs_read_extent_buffer+0x98/0x150
read_tree_block+0x43/0xa0
read_block_for_search+0x266/0x370
btrfs_search_slot+0x351/0xd30
? lock_is_held_type+0xe8/0x140
btrfs_lookup_csum+0x63/0x150
btrfs_csum_file_blocks+0x197/0x6c0
? sched_clock_cpu+0x9f/0xc0
? lock_release+0x14b/0x440
? _raw_read_unlock+0x29/0x50
btrfs_finish_ordered_io+0x441/0x860
btrfs_work_helper+0xfe/0x400
? lock_is_held_type+0xe8/0x140
process_one_work+0x294/0x5b0
worker_thread+0x4f/0x3a0
? __pfx_worker_thread+0x10/0x10
kthread+0xf5/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>

Currently we only copy the btrfs_tree_parent_check structure into bbio
at read_extent_buffer_pages() after we have assembled the bbio.

But as shown in the above call trace, submit_extent_page() itself can
already submit the bbio, leaving the bbio->parent_check uninitialized,
and cause the false alert.

[FIX]
Instead of copying @check into bbio after bbio is assembled, we pass
@check in btrfs_bio_ctrl::parent_check, and copy the content of
parent_check in submit_one_bio() for metadata read.

By this, we should be able to pass the needed info for metadata endio
verification, and fix the false alert.

Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx>
Link: Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx>
Fixes: 947a629988f1 ("btrfs: move tree block parentness check into validate_extent_buffer()")
Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
fs/btrfs/extent_io.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 83dd3aa59663..b11332482d57 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -103,6 +103,16 @@ struct btrfs_bio_ctrl {
u32 len_to_oe_boundary;
btrfs_bio_end_io_t end_io_func;

+ /*
+ * This is for metadata read, to provide the extra needed
+ * verification info.
+ * This has to be provided for submit_one_bio(), as submit_one_bio()
+ * can submit a bio if it ends at stripe boundary.
+ * If no such parent_check provided, the metadata can hit false alert
+ * at endio time.
+ */
+ struct btrfs_tree_parent_check *parent_check;
+
/*
* Tell writepage not to lock the state bits for this range, it still
* does the unlocking.
@@ -133,13 +143,24 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)

btrfs_bio(bio)->file_offset = page_offset(bv->bv_page) + bv->bv_offset;

- if (!is_data_inode(&inode->vfs_inode))
+ if (!is_data_inode(&inode->vfs_inode)) {
+ if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
+ /*
+ * For metadata read, we should have the parent_check,
+ * and copy it to bbio for metadata verification.
+ */
+ ASSERT(bio_ctrl->parent_check);
+ memcpy(&btrfs_bio(bio)->parent_check,
+ bio_ctrl->parent_check,
+ sizeof(struct btrfs_tree_parent_check));
+ }
btrfs_submit_metadata_bio(inode, bio, mirror_num);
- else if (btrfs_op(bio) == BTRFS_MAP_WRITE)
+ } else if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
btrfs_submit_data_write_bio(inode, bio, mirror_num);
- else
+ } else {
btrfs_submit_data_read_bio(inode, bio, mirror_num,
bio_ctrl->compress_type);
+ }

/* The bio is owned by the end_io handler now */
bio_ctrl->bio = NULL;
@@ -4829,6 +4850,7 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
struct extent_state *cached_state = NULL;
struct btrfs_bio_ctrl bio_ctrl = {
.mirror_num = mirror_num,
+ .parent_check = check,
};
int ret = 0;

@@ -4878,7 +4900,6 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
*/
atomic_dec(&eb->io_pages);
}
- memcpy(&btrfs_bio(bio_ctrl.bio)->parent_check, check, sizeof(*check));
submit_one_bio(&bio_ctrl);
if (ret || wait != WAIT_COMPLETE) {
free_extent_state(cached_state);
@@ -4905,6 +4926,7 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num,
unsigned long num_reads = 0;
struct btrfs_bio_ctrl bio_ctrl = {
.mirror_num = mirror_num,
+ .parent_check = check,
};

if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
@@ -4996,7 +5018,6 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num,
}
}

- memcpy(&btrfs_bio(bio_ctrl.bio)->parent_check, check, sizeof(*check));
submit_one_bio(&bio_ctrl);

if (ret || wait != WAIT_COMPLETE)
--
2.39.0