Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE

From: Namjae Jeon
Date: Sun Aug 04 2013 - 04:29:17 EST

Next message: Greg KH: "Linux 3.0.89"
Previous message: Michal Hocko: "Re: [PATCH 02/23] memcg, thp: charge huge cache pages"
In reply to: Dave Chinner: "Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE"
Next in thread: Namjae Jeon: "Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

>
>> >> + if (cur) {
>> >> + if (!error)
>> >> + cur->bc_private.b.allocated = 0;
>> >
>> > Um, why?
>> Okay, will remove.
>
> I'm asking you to explain why you put it there. Don't remove code
> that might be necessary just because a hard question was asked....
We copied this code from xfs_bunmapi. And we realize that why it is
there because xfs_bunmapi may split the extents, which could require
block allocation for btree, but I think that is not necessary here
because updating starting offsets of extents would not reqire a btree
split and allocation.
>
>> >> @@ -845,6 +850,21 @@ xfs_file_fallocate(
>> >> if (error)
>> >> goto out_unlock;
>> >>
>> >> + if (mode & FALLOC_FL_COLLAPSE_RANGE) {
>> >> + error = -xfs_collapse_file_space(ip, offset + len, len);
>> >> + if (error || offset >= i_size_read(inode))
>> >> + goto out_unlock;
>> >> +
>> >> + /* Shrink size in case of FALLOC_FL_COLLAPSE_RANGE */
>> >> + if ((offset + len) > i_size_read(inode))
>> >> + new_size = offset;
>> >> + else
>> >> + new_size = i_size_read(inode) - len;
>> >> + error = -xfs_update_file_size(ip, new_size);
>> >> +
>> >> + goto out_unlock;
>> >> + }
>> >
>> > This needs to vector through xfs_change_file_space() - it already
>> > has code for doing offset/range validity checks, attaching
>> > appropriate dquots for accounting, post-op file size and timestamp
>> > updates, etc.
>> It already is going through xfs_change_file_space(). Check this=>
>
> No it isn't - this is calling xfs_collapse_file_space from
> xfs_file_fallocate(). That is not going through
> xfs_change_file_space().
>
> Oh, I see now, this hunk is *after* the xfs_change_file_space()
> call. That's nasty - it's a layering violation, pure and simple.
>
> No wonder I thought that no hole punching was done, it's triggered
> by a non-obvious flag set and so hidden three layers away from the
> xfs_collapse_file_space() call that acts on the hole that was
> punched. This code *must* go through the same code paths as the
> other fallocate operations so it is obvious how it interacts with
> everything.
>
>> >> +
>> >> +/*
>> >> + * xfs_update_file_size()
>> >> + * Just a simple disk size and time update
>> >> + */
>> >> +int
>> >> +xfs_update_file_size(
>> >> + struct xfs_inode *ip,
>> >> + loff_t newsize)
>> >
>> > This function should be nuked from orbit. I stopped looking at it
>> > when the bug count got past 5. If you use xfs_change_file_space,
>> > it's not necessary, either.
>> we are using xfs_change_file_space(). See my comment above. :)
>
> Yes, badly. See my comment above. :)
>
>> But, xfs_change_file_space does not change logical file size.
>> Neither can we use xfs_setattr, because it will truncate the
>> preallocated extents beyond EOF.
>
> And the problem with that is?
>
> Seriously, if you are chopping chunks out of a file and moving
> extents around in it, you aren't going to be writing to it while
> that is happening. For NLE workflows, this manipulation happens
> after the entire stream is written. If you collapse the range while
> the video is being written, you are going to end up with big
> chunks of zeroes in the file as you the write() has a file position
> way beyond the new EOF....
>
>> >> diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
>> >> index dc730ac..95b46e7 100644
>> >> --- a/fs/xfs/xfs_vnodeops.c
>> >> +++ b/fs/xfs/xfs_vnodeops.c
>> >> @@ -1868,3 +1868,65 @@ xfs_change_file_space(
>> >> xfs_trans_set_sync(tp);
>> >> return xfs_trans_commit(tp, 0);
>> >> }
>> >> +
>> >> +/*
>> >> + * xfs_collapse_file_space()
>> >> + * This implements the fallocate collapse range functionlaity.
>> >> + * It removes the hole from file by shifting all the extents which
>> >> + * lies beyond start block.
>> >> + */
>> >> +int
>> >> +xfs_collapse_file_space(
>> >> + xfs_inode_t *ip,
>> >> + loff_t start,
>> >> + loff_t shift)
>> >> +{
>> >> + int done = 0;
>> >> + xfs_mount_t *mp = ip->i_mount;
>> >> + uint resblks;
>> >> + xfs_trans_t *tp;
>> >> + int error = 0;
>> >> + xfs_extnum_t current_ext = 0;
>> >> + xfs_fileoff_t start_fsb = XFS_B_TO_FSB(mp, start);
>> >> + xfs_fileoff_t shift_fsb = XFS_B_TO_FSB(mp, shift);
>> >> + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
>> >> +
>> >> + while (!error && !done) {
>> >> + tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
>> >> + tp->t_flags |= XFS_TRANS_RESERVE;
>> >> + error = xfs_trans_reserve(tp, resblks, XFS_WRITE_LOG_RES(mp),
>> >> + 0, XFS_TRANS_PERM_LOG_RES,
>> >> + XFS_WRITE_LOG_COUNT);
>> >
>> > Why a permanent log reservation for a single, non-nested transaction?
>> XFS transaction log reservation code is becoming our major problem.
>> Could you suggest a proper way?
>
> Permanent log transactions are only needed for transactions that
> commit multiple times between reservations. You are doing:
>
> tp = alloc()
> reserve(tp, XFS_WRITE_LOG_RES, XFS_TRANS_PERM_LOG_RES,
> XFS_WRITE_LOG_COUNT)
> commit(tp, XFS_TRANS_RELEASE_LOG_RES)
>
> It's a single transaction. Permanent log transactions are used for
> multi-stage, rolling transactions that can be broken up into
> multiple atomic operations, so as freeing multiple extents from a
> file:
>
> tp = alloc()
> reserve(tp, XFS_WRITE_LOG_RES, XFS_TRANS_PERM_LOG_RES,
> XFS_WRITE_LOG_COUNT)
> .....
> tp2 = dup(tp)
> commit(tp)
> reserve(tp2, XFS_WRITE_LOG_RES, XFS_TRANS_PERM_LOG_RES,
> XFS_WRITE_LOG_COUNT)
> ....
> commit(tp2, XFS_TRANS_PERM_LOG_RES).
>
>
> The dup/reserve/commit loop keeps the same transaction context
> across the whole operation and allows them to make continual forward
> progress.
>
> Hmmmm. In looking at this, I notice the update case that uses a
> btree cursor doesn't have an the flist/firstblock initialised.
> That'll cause an oops if a block is ever allocated or freed in a
> record update. That would also indicate that the above does indeed
> need a permanent log transaction and probably needs to be structured
> similar to xfs_itruncate_extents with
> xfs_bmap_init/xfs_bmap_finish() and a rolling transaction just in
> case we end up modifying the btree.
Ok, we noted all your points and understand that a lot of work is
really needed to make it stable. we will try to implement them in next
version of patch. Really thanks for your time and help.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Greg KH: "Linux 3.0.89"
Previous message: Michal Hocko: "Re: [PATCH 02/23] memcg, thp: charge huge cache pages"
In reply to: Dave Chinner: "Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE"
Next in thread: Namjae Jeon: "Re: [PATCH 2/3] xfs: Implement FALLOC_FL_COLLAPSE_RANGE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]