Re: [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

From: Ojaswin Mujoo
Date: Tue Jan 17 2023 - 05:33:59 EST


On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > Since this covers the special case we discussed above, we will always
> > un-delete the PA when we encounter the special case and we can then
> > adjust for overlap and traverse the PA rbtree without any issues.
> >
> > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>

Hi Jan,
Thanks for the review, sharing some of my thoughts below.

>
> So I find this putting back of already deleted inode PA very fragile. For
> example in current code I suspect you've missed a case in ext4_mb_put_pa()
> which can mark inode PA (so it can then be spotted by
> ext4_mb_pa_adjust_overlap() and marked as in use again) but
> ext4_mb_put_pa() still goes on and destroys the PA.

The 2 code paths that clash here are:

ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_new_blocks()
ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()

Since these are the only code paths from which these 2 functions are
called, for a given inode, access will always be serialized by the upper
level ei->i_data_sem, which is always taken when writing data blocks
using ext4_mb_new_block().

>From my understanding of the code, I feel only
ext4_mb_discard_group_preallocations() can race against other functions
that are modifying the PA rbtree since it does not take any inode locks.

That being said, I do understand your concerns regarding the solution,
however I'm willing to work with the community to ensure our
implementation of this undelete feature is as robust as possible. Along
with fixing the bug reported here [1], I believe that it is also a good
optimization to have especially when the disk is near full and we are
seeing a lot of group discards going on.

Also, in case the deleted PA completely lies inside our new range, it is
much better to just undelete and use it rather than deleting the
existing PA and reallocating the range again. I think the advantage
would be even bigger in ext4_mb_use_preallocated() function where we can
just undelete and use the PA and skip the entire allocation, incase original
range lies in a deleted PA.

>
> So I'd really love to stay with the invariant that once PA is marked as
> deleted, it can never go alive again. Since finding such deleted PA that is
> overlapping our new range should be really rare, cannot we just make
> ext4_mb_pa_adjust_overlap() rb_erase() this deleted PA and restart the
> adjustment search? Since rb_erase() is not safe to be called twice, we'd
> have to record somewhere in the PA that the erasure has already happened
> (probably we could have two flags in 'deleted' field - deleted from group
> list, deleted from object (lg_list/inode_node)) but that is still much more
> robust...
Right, this is an apporach we did consider but we felt it would be a
better to take this opportunity to make the above optimization. Would
love to hear your thoughts on this.

Thanks,
Ojaswin

>
> Honza
>
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR