Re: [PATCH v3 09/12] ext4: introduce mext_move_extent()

From: Jan Kara
Date: Fri Oct 10 2025 - 09:39:16 EST


On Fri 10-10-25 18:33:23, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> When moving extents, the current move_extent_per_page() process can only
> move extents of length PAGE_SIZE at a time, which is highly inefficient,
> especially when the fragmentation of the file is not particularly
> severe, this will result in a large number of unnecessary extent split
> and merge operations. Moreover, since the ext4 file system now supports
> large folios, using PAGE_SIZE as the processing unit is no longer
> practical.
>
> Therefore, introduce a new move extents method, mext_move_extent(). It
> moves one extent of the origin inode at a time, but not exceeding the
> size of a folio. The parameters for the move are passed through the new
> mext_data data structure, which includes the origin inode, donor inode,
> the mapping extent of the origin inode to be moved, and the starting
> offset of the donor inode.
>
> The move process is similar to move_extent_per_page() and can be
> categorized into three types: MEXT_SKIP_EXTENT, MEXT_MOVE_EXTENT, and
> MEXT_COPY_DATA. MEXT_SKIP_EXTENT indicates that the corresponding area
> of the donor file is a hole, meaning no actual space is allocated, so
> the move is skipped. MEXT_MOVE_EXTENT indicates that the corresponding
> areas of both the origin and donor files are unwritten, so no data needs
> to be copied; only the extents are swapped. MEXT_COPY_DATA indicates
> that the corresponding areas of both the origin and donor files contain
> data, so data must be copied. The data copying is performed in three
> steps: first, the data from the original location is read into the page
> cache; then, the extents are swapped, and the page cache is rebuilt to
> reflect the index of the physical blocks; finally, the dirty page cache
> is marked and written back to ensure that the data is written to disk
> before the metadata is persisted.
>
> One important point to note is that the folio lock and i_data_sem are
> held only during the moving process. Therefore, before moving an extent,
> it is necessary to check whether the sequence cookie of the area to be
> moved has changed while holding the folio lock. If a change is detected,
> it indicates that concurrent write-back operations may have occurred
> during this period, and the type of the extent to be moved can no longer
> be considered reliable. For example, it may have changed from unwritten
> to written. In such cases, return -ESTALE, and the calling function
> should reacquire the move extent of the original file and retry the
> movement.
>
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>

...

> +static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
> +{
> + struct inode *orig_inode = mext->orig_inode;
> + struct inode *donor_inode = mext->donor_inode;
> + struct ext4_map_blocks *orig_map = &mext->orig_map;
> + unsigned int blkbits = orig_inode->i_blkbits;
> + struct folio *folio[2] = {NULL, NULL};
> + loff_t from, length;
> + enum mext_move_type move_type = 0;
> + handle_t *handle;
> + u64 r_len = 0;
> + unsigned int credits;
> + int ret, ret2;
> +
> + *m_len = 0;
> + credits = ext4_chunk_trans_extent(orig_inode, 0) * 2;
> + handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, credits);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + ret = mext_move_begin(mext, folio, &move_type);
> + if (ret)
> + goto stop_handle;
> +
> + if (move_type == MEXT_SKIP_EXTENT)
> + goto unlock;
> +
> + /*
> + * Copy the data. First, read the original inode data into the page
> + * cache. Then, release the existing mapping relationships and swap
> + * the extent. Finally, re-establish the new mapping relationships
> + * and dirty the page cache.
> + */
> + if (move_type == MEXT_COPY_DATA) {
> + from = offset_in_folio(folio[0],
> + ((loff_t)orig_map->m_lblk) << blkbits);
> + length = ((loff_t)orig_map->m_len) << blkbits;
> +
> + ret = mext_folio_mkuptodate(folio[0], from, from + length);
> + if (ret)
> + goto unlock;
> + }
> +
> + if (!filemap_release_folio(folio[0], 0) ||
> + !filemap_release_folio(folio[1], 0)) {
> + ret = -EBUSY;
> + goto unlock;
> + }
> +
> + /* Move extent */
> + ext4_double_down_write_data_sem(orig_inode, donor_inode);
> + *m_len = ext4_swap_extents(handle, orig_inode, donor_inode,
> + orig_map->m_lblk, mext->donor_lblk,
> + orig_map->m_len, 1, &ret);
> + ext4_double_up_write_data_sem(orig_inode, donor_inode);
> +
> + /* A short-length swap cannot occur after a successful swap extent. */
> + if (WARN_ON_ONCE(!ret && (*m_len != orig_map->m_len)))
> + ret = -EIO;
> +
> + if (!(*m_len) || (move_type == MEXT_MOVE_EXTENT))
> + goto unlock;
> +
> + /* Copy data */
> + length = (*m_len) << blkbits;
> + ret = mext_folio_mkwrite(orig_inode, folio[0], from, from + length);
> + if (ret)
> + goto repair_branches;

I think you need to be careful here and below to not overwrite 'ret' if it
is != 0. So something like:

ret2 = mext_folio_mkwrite(..)
if (ret2) {
if (!ret)
ret = ret2;
goto repair_branches;
}

and something similar below. Otherwise the patch looks good to me.

Honza

> + /*
> + * Even in case of data=writeback it is reasonable to pin
> + * inode to transaction, to prevent unexpected data loss.
> + */
> + ret = ext4_jbd2_inode_add_write(handle, orig_inode,
> + ((loff_t)orig_map->m_lblk) << blkbits, length);
> +unlock:
> + mext_folio_double_unlock(folio);
> +stop_handle:
> + ext4_journal_stop(handle);
> + return ret;
> +
> +repair_branches:
> + r_len = ext4_swap_extents(handle, donor_inode, orig_inode,
> + mext->donor_lblk, orig_map->m_lblk,
> + *m_len, 0, &ret2);
> + if (ret2 || r_len != *m_len) {
> + ext4_error_inode_block(orig_inode, (sector_t)(orig_map->m_lblk),
> + EIO, "Unable to copy data block, data will be lost!");
> + ret = -EIO;
> + }
> + *m_len = 0;
> + goto unlock;
> +}
> +
> /**
> * move_extent_per_page - Move extent data per page
> *
> --
> 2.46.1
>
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR