Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

From: Josef Bacik
Date: Wed Jun 17 2020 - 14:19:33 EST

Next message: Linus Torvalds: "Re: rename probe_kernel_* and probe_user_*"
Previous message: Steve Longerbeam: "Re: [PATCH] media: rcar-vin: Move media_device_register to async completion"
In reply to: Filipe Manana: "Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race"
Next in thread: Filipe Manana: "Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/17/20 2:11 PM, Filipe Manana wrote:

On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <clm@xxxxxx> wrote:

On 17 Jun 2020, at 13:20, Filipe Manana wrote:

On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@xxxxxx> wrote:

---
fs/btrfs/extent_io.c | 45
++++++++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..f6758ebbb6a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,11 @@ static noinline_for_stack int
write_one_eb(struct extent_buffer *eb,
clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
num_pages = num_extent_pages(eb);
atomic_set(&eb->io_pages, num_pages);
+ /*
+ * It is possible for releasepage to clear the TREE_REF bit
before we
+ * set io_pages. See check_buffer_tree_ref for a more
detailed comment.
+ */
+ check_buffer_tree_ref(eb);

This is a whole different case from the one described in the
changelog, as this is in the write path.
Why do we need this one?

This was Josefâs idea, but I really like the symmetry. You set
io_pages, you do the tree_ref dance. Everyone fiddling with the write
back bit right now correctly clears writeback after doing the atomic_dec
on io_pages, but the race is tiny and prone to getting exposed again by
shifting code around. Tree ref checks around io_pages are the most
reliable way to prevent this bug from coming back again later.

Ok, but that still doesn't answer my question.
Is there an actual race/problem this hunk solves?

Before calling write_one_eb() we increment the ref on the eb and we
also call lock_extent_buffer_for_io(),
which clears the dirty bit and sets the writeback bit on the eb while
holding its ref_locks spin_lock.

Even if we get to try_release_extent_buffer, it calls
extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
so at any time it should return true, as either the dirty or the
writeback bit is set.

Is this purely a safety guard that is being introduced?

Can we at least describe in the changelog why we are adding this hunk
in the write path?
All it mentions is a race between reading and releasing pages, there's
nothing mentioned about races with writeback.

I think maybe we make that bit a separate patch, and leave the panic fix on it's own. I suggested this because I have lofty ideas of killing the refs_lock, and this would at least keep us consistent in our usage TREE_REF to save us from freeing stuff that may be currently under IO.

I'm super not happy with our reference handling coupled with releasepage, but these are the kind of hoops we're going to have to jump through until we have some better infrastructure in place to handle metadata. Thanks,

Josef

Next message: Linus Torvalds: "Re: rename probe_kernel_* and probe_user_*"
Previous message: Steve Longerbeam: "Re: [PATCH] media: rcar-vin: Move media_device_register to async completion"
In reply to: Filipe Manana: "Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race"
Next in thread: Filipe Manana: "Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]