Re: [PATCH] ceph: re-org copy_file_range and fix error handling paths

From: Jeff Layton
Date: Thu Feb 20 2020 - 17:54:18 EST


On Thu, 2020-02-20 at 22:36 +0000, Luis Henriques wrote:
> On Thu, Feb 20, 2020 at 03:41:14PM -0500, Jeff Layton wrote:
> > On Mon, 2020-02-17 at 12:36 +0000, Luis Henriques wrote:
> > > This patch re-organizes copy_file_range, trying to fix a few issues in
> > > error handling. Here's the summary:
> > >
> > > - Abort copy if initial do_splice_direct() returns fewer bytes than
> > > requested.
> > >
> > > - Move the 'size' initialization (with i_size_read()) further down in the
> > > code, after the initial call to do_splice_direct(). This avoids issues
> > > with a possibly stale value if a manual copy is done.
> > >
> > > - Move the object copy loop into a separate function. This makes it
> > > easier to handle errors (e.g, dirtying caps and updating the MDS
> > > metadata if only some objects have been copied before an error has
> > > occurred).
> > >
> > > - Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc
> > > and dst_oloc
> > >
> > > - After the object copy loop, the new file size to be reported to the MDS
> > > (if there's file size change) is now the actual file size, and not the
> > > size after an eventual extra manual copy.
> > >
> > > - Added a few dout() to show the number of bytes copied in the two manual
> > > copies and in the object copy loop.
> > >
> > > Signed-off-by: Luis Henriques <lhenriques@xxxxxxxx>
> > > ---
> > > Hi!
> > >
> > > Initially I was going to have this patch split in a series, but then I
> > > decided not to do that as this big patch allows (IMO) to better see the
> > > whole picture. But please let me know if you think otherwise and I can
> > > split it in a few smaller patches.
> > >
> > > I tried to cover all the issues that have been pointed out by Ilya, but I
> > > may have missed something or, more likely, introduced new bugs ;-)
> > >
> > > Cheers,
> > > --
> > > Luis
> > >
> >
> > Sorry for the delay in review!
>
> No worries, I appreciate the feedback but I obviously don't expect it to
> happen immediately :-)
>
> > > fs/ceph/file.c | 169 ++++++++++++++++++++++++++++---------------------
> > > 1 file changed, 96 insertions(+), 73 deletions(-)
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index c3b8e8e0bf17..4d90a275f9a5 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1931,6 +1931,71 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
> > > return 0;
> > > }
> > >
> > > +static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off,
> > > + struct ceph_inode_info *dst_ci, u64 *dst_off,
> > > + struct ceph_fs_client *fsc,
> > > + size_t len, unsigned int flags)
> > > +{
> > > + struct ceph_object_locator src_oloc, dst_oloc;
> > > + struct ceph_object_id src_oid, dst_oid;
> > > + size_t bytes = 0;
> > > + u64 src_objnum, src_objoff, dst_objnum, dst_objoff;
> > > + u32 src_objlen, dst_objlen;
> > > + u32 object_size = src_ci->i_layout.object_size;
> > > + int ret;
> > > +
> > > + src_oloc.pool = src_ci->i_layout.pool_id;
> > > + src_oloc.pool_ns = ceph_try_get_string(src_ci->i_layout.pool_ns);
> > > + dst_oloc.pool = dst_ci->i_layout.pool_id;
> > > + dst_oloc.pool_ns = ceph_try_get_string(dst_ci->i_layout.pool_ns);
> > > +
> > > + while (len >= object_size) {
> > > + ceph_calc_file_object_mapping(&src_ci->i_layout, *src_off,
> > > + object_size, &src_objnum,
> > > + &src_objoff, &src_objlen);
> > > + ceph_calc_file_object_mapping(&dst_ci->i_layout, *dst_off,
> > > + object_size, &dst_objnum,
> > > + &dst_objoff, &dst_objlen);
> > > + ceph_oid_init(&src_oid);
> > > + ceph_oid_printf(&src_oid, "%llx.%08llx",
> > > + src_ci->i_vino.ino, src_objnum);
> > > + ceph_oid_init(&dst_oid);
> > > + ceph_oid_printf(&dst_oid, "%llx.%08llx",
> > > + dst_ci->i_vino.ino, dst_objnum);
> > > + /* Do an object remote copy */
> > > + ret = ceph_osdc_copy_from(&fsc->client->osdc,
> > > + src_ci->i_vino.snap, 0,
> > > + &src_oid, &src_oloc,
> > > + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
> > > + CEPH_OSD_OP_FLAG_FADVISE_NOCACHE,
> > > + &dst_oid, &dst_oloc,
> > > + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
> > > + CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > + dst_ci->i_truncate_seq,
> > > + dst_ci->i_truncate_size,
> > > + CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > + if (ret) {
> > > + if (ret == -EOPNOTSUPP) {
> > > + fsc->have_copy_from2 = false;
> > > + pr_notice("OSDs don't support copy-from2; disabling copy offload\n");
> > > + }
> > > + dout("ceph_osdc_copy_from returned %d\n", ret);
> > > + if (!bytes)
> > > + bytes = ret;
> > > + goto out;
> > > + }
> > > + len -= object_size;
> > > + bytes += object_size;
> > > + *src_off += object_size;
> > > + *dst_off += object_size;
> > > + }
> > > +
> > > +out:
> > > + ceph_oloc_destroy(&src_oloc);
> > > + ceph_oloc_destroy(&dst_oloc);
> > > + return bytes;
> > > +}
> > > +
> > > static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > struct file *dst_file, loff_t dst_off,
> > > size_t len, unsigned int flags)
> > > @@ -1941,14 +2006,11 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> > > struct ceph_cap_flush *prealloc_cf;
> > > struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > > - struct ceph_object_locator src_oloc, dst_oloc;
> > > - struct ceph_object_id src_oid, dst_oid;
> > > - loff_t endoff = 0, size;
> > > - ssize_t ret = -EIO;
> > > + loff_t size;
> > > + ssize_t ret = -EIO, bytes;
> > > u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> > > - u32 src_objlen, dst_objlen, object_size;
> > > + u32 src_objlen, dst_objlen;
> > > int src_got = 0, dst_got = 0, err, dirty;
> > > - bool do_final_copy = false;
> > >
> > > if (src_inode->i_sb != dst_inode->i_sb) {
> > > struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > @@ -2026,22 +2088,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > if (ret < 0)
> > > goto out_caps;
> > >
> > > - size = i_size_read(dst_inode);
> > > - endoff = dst_off + len;
> > > -
> > > /* Drop dst file cached pages */
> > > ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
> > > dst_off >> PAGE_SHIFT,
> > > - endoff >> PAGE_SHIFT);
> > > + (dst_off + len) >> PAGE_SHIFT);
> > > if (ret < 0) {
> > > dout("Failed to invalidate inode pages (%zd)\n", ret);
> > > ret = 0; /* XXX */
> > > }
> > > - src_oloc.pool = src_ci->i_layout.pool_id;
> > > - src_oloc.pool_ns = ceph_try_get_string(src_ci->i_layout.pool_ns);
> > > - dst_oloc.pool = dst_ci->i_layout.pool_id;
> > > - dst_oloc.pool_ns = ceph_try_get_string(dst_ci->i_layout.pool_ns);
> > > -
> > > ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> > > src_ci->i_layout.object_size,
> > > &src_objnum, &src_objoff, &src_objlen);
> > > @@ -2060,6 +2114,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > * starting at the src_off
> > > */
> > > if (src_objoff) {
> > > + dout("Initial partial copy of %u bytes\n", src_objlen);
> > > +
> > > /*
> > > * we need to temporarily drop all caps as we'll be calling
> > > * {read,write}_iter, which will get caps again.
> > > @@ -2067,8 +2123,9 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got);
> > > ret = do_splice_direct(src_file, &src_off, dst_file,
> > > &dst_off, src_objlen, flags);
> > > - if (ret < 0) {
> > > - dout("do_splice_direct returned %d\n", err);
> > > + /* Abort on short copies or on error */
> > > + if (ret < src_objlen) {
> > > + dout("Failed partial copy (%zd)\n", ret);
> > > goto out;
> > > }
> > > len -= ret;
> > > @@ -2081,62 +2138,29 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > if (err < 0)
> > > goto out_caps;
> > > }
> > > - object_size = src_ci->i_layout.object_size;
> > > - while (len >= object_size) {
> > > - ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> > > - object_size, &src_objnum,
> > > - &src_objoff, &src_objlen);
> > > - ceph_calc_file_object_mapping(&dst_ci->i_layout, dst_off,
> > > - object_size, &dst_objnum,
> > > - &dst_objoff, &dst_objlen);
> > > - ceph_oid_init(&src_oid);
> > > - ceph_oid_printf(&src_oid, "%llx.%08llx",
> > > - src_ci->i_vino.ino, src_objnum);
> > > - ceph_oid_init(&dst_oid);
> > > - ceph_oid_printf(&dst_oid, "%llx.%08llx",
> > > - dst_ci->i_vino.ino, dst_objnum);
> > > - /* Do an object remote copy */
> > > - err = ceph_osdc_copy_from(
> > > - &src_fsc->client->osdc,
> > > - src_ci->i_vino.snap, 0,
> > > - &src_oid, &src_oloc,
> > > - CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
> > > - CEPH_OSD_OP_FLAG_FADVISE_NOCACHE,
> > > - &dst_oid, &dst_oloc,
> > > - CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
> > > - CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > - dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > - CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > - if (err) {
> > > - if (err == -EOPNOTSUPP) {
> > > - src_fsc->have_copy_from2 = false;
> > > - pr_notice("OSDs don't support copy-from2; disabling copy offload\n");
> > > - }
> > > - dout("ceph_osdc_copy_from returned %d\n", err);
> > > - if (!ret)
> > > - ret = err;
> > > - goto out_caps;
> > > - }
> > > - len -= object_size;
> > > - src_off += object_size;
> > > - dst_off += object_size;
> > > - ret += object_size;
> > > - }
> > >
> > > - if (len)
> > > - /* We still need one final local copy */
> > > - do_final_copy = true;
> > > + size = i_size_read(dst_inode);
> > > + bytes = ceph_do_objects_copy(src_ci, &src_off, dst_ci, &dst_off,
> > > + src_fsc, len, flags);
> > > + if (bytes <= 0) {
> > > + if (!ret)
> > > + ret = bytes;
> > > + goto out_caps;
> > > + }
> >
> > Suppose we did the front part with do_splice_direct (ret > 0), but then
> > ceph_do_objects_copy fails (bytes < 0). We "goto out_caps" but then...
> >
> > [...]
> >
> > > @@ -2152,15 +2176,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > out_caps:
> > > put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got);
> > >
> > > - if (do_final_copy) {
> > > - err = do_splice_direct(src_file, &src_off, dst_file,
> > > - &dst_off, len, flags);
> > > - if (err < 0) {
> > > - dout("do_splice_direct returned %d\n", err);
> > > - goto out;
> > > - }
> > > - len -= err;
> > > - ret += err;
> > > + if (len && (ret >= 0)) {
> >
> > ...len is still positive and we do_splice_direct again (probably at the
> > wrong offset?), instead of just returning a short copy. I think we
> > probably want to just stop any copying if it fails at any point along
> > the way, right?
>
> Well... To be honest I deliberately wanted to try a do_splice_direct in
> case the remote object copies fail (basically, reverting to something
> similar to the generic_copy_file_range). Now, I've been staring at this
> code for some time and I may be missing the obvious (again!). But:
>
> - the offsets should be OK because ceph_do_objects_copy() only updates
> them after each successful object copy
>
> - len should also be consistent:
> * If 'bytes' was <= 0, it should contain what was written by
> do_splice_direct
> * if 'bytes' was > 0, but possibly < expected (e.g. an OSD returned an
> error after a few object copies), len should still be consistent
>
> Anyway, I'm not too attached to this approach, and if you rather have this
> function to return in the scenario you've described (and eventually have
> the user to retry the operation) I'm OK with that.
>

Yes, sorry. I wasn't disputing whether this would fall over, but whether
it was intended behavior.

I'm not sure we gain anything by doing a second splice once we've had a
failure though. I think if this isn't going to (largely) use copy
offloading then we probably ought to stop and just return to userland as
quickly as we can.


> > > + dout("Final partial copy of %zu bytes\n", len);
> > > + bytes = do_splice_direct(src_file, &src_off, dst_file,
> > > + &dst_off, len, flags);
> > > + if (bytes > 0)
> > > + ret += bytes;
> > > + else
> > > + dout("Failed partial copy (%zd)\n", bytes);
> > > }
> > >
> > > out:
> >
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> >

--
Jeff Layton <jlayton@xxxxxxxxxx>