Re: [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range

From: Gregory Farnum
Date: Tue Jan 28 2020 - 12:16:03 EST


On Mon, Jan 27, 2020 at 7:52 PM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>
> On Mon, Jan 27, 2020 at 07:16:17PM +0100, Ilya Dryomov wrote:
> > On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@xxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > As discussed here[1] I'm sending an RFC patchset that does the
> > > parallelization of the requests sent to the OSDs during a copy_file_range
> > > syscall in CephFS.
> > >
> > > [1] https://lore.kernel.org/lkml/20200108100353.23770-1-lhenriques@xxxxxxxx/
> > >
> > > I've also some performance numbers that I wanted to share. Here's a
> > > description of the very simple tests I've run:
> > >
> > > - create a file with 200 objects in it
> > > * i.e. tests with different object sizes mean different file sizes
> > > - drop all caches and umount the filesystem
> > > - Measure:
> > > * mount filesystem
> > > * full file copy (with copy_file_range)
> > > * umount filesystem
> > >
> > > Tests were repeated several times and the average value was used for
> > > comparison.
> > >
> > > DISCLAIMER:
> > > These numbers are only indicative, and different clusters and client
> > > configs will for sure show different performance! More rigorous tests
> > > would be require to validate these results.
> > >
> > > Having as baseline a full read+write (basically, a copy_file_range
> > > operation within a filesystem mounted without the 'copyfrom' option),
> > > here's some values for different object sizes:
> > >
> > > 8M 4M 1M 65k
> > > read+write 100% 100% 100% 100%
> > > sequential 51% 52% 83% >100%
> > > parallel (throttle=1) 51% 52% 83% >100%
> > > parallel (throttle=0) 17% 17% 83% >100%
> > >
> > > Notes:
> > >
> > > - 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200
> > > requests to copy the 200 objects in the file) were sent to the OSDs and
> > > the wait for requests completion is done at the end only.
> > >
> > > - 'parallel (throttle=1)' was just a control test, where the wait for
> > > completion is done immediately after a request is sent. It was expected
> > > to be very similar to the non-optimized ('sequential') tests.
> > >
> > > - These tests were executed on a cluster with 40 OSDs, spread across 5
> > > (bare-metal) nodes.
> > >
> > > - The tests with object size of 65k show that copy_file_range definitely
> > > doesn't scale to files with small object sizes. '> 100%' actually means
> > > more than 10x slower.
> > >
> > > Measuring the mount+copy+umount masks the actual difference between
> > > different throttle values due to the time spent in mount+umount. Thus,
> > > there was no real difference between throttle=0 (send all and wait) and
> > > throttle=20 (send 20, wait, send 20, ...). But here's what I observed
> > > when measuring only the copy operation (4M object size):
> > >
> > > read+write 100%
> > > parallel (throttle=1) 56%
> > > parallel (throttle=5) 23%
> > > parallel (throttle=10) 14%
> > > parallel (throttle=20) 9%
> > > parallel (throttle=5) 5%
> >
> > Was this supposed to be throttle=50?
>
> Ups, no it should be throttle=0 (i.e. no throttle).
>
> > >
> > > Anyway, I'll still need to revisit patch 0003 as it doesn't follow the
> > > suggestion done by Jeff to *not* add another knob to fine-tune the
> > > throttle value -- this patch adds a kernel parameter for a knob that I
> > > wanted to use in my testing to observe different values of this throttle
> > > limit.
> > >
> > > The goal is to probably to drop this patch and do the throttling in patch
> > > 0002. I just need to come up with a decent heuristic. Jeff's suggestion
> > > was to use rsize/wsize, which are set to 64M by default IIRC. Somehow I
> > > feel that it should be related to the number of OSDs in the cluster
> > > instead, but I'm not sure how. And testing these sort of heuristics would
> > > require different clusters, which isn't particularly easy to get. Anyway,
> > > comments are welcome!
> >
> > I agree with Jeff, this throttle is certainly not worth a module
> > parameter (or a mount option). I would start with something like
> > C * (wsize / object size) and pick C between 1 and 4.
>
> Sure, I also agree with not adding the new parameter or mount option.
> It's just tricky to pick (and test!) the best formula to use. From your
> proposal the throttle value would be by default between 16 and 64; those
> probably work fine in some situations (for example, in the cluster I used
> for running my tests). But for a really big cluster, with hundreds of
> OSDs, it's difficult to say.

We don't really need a single client to be capable of spraying the
entire cluster in a single operation â as the wsize is already an
effective control over how parallel a single write is allowed to be, I
think we're okay using it as the basis for copy_file_range as well
without worrying about scaling it up!.
-Greg

>
> Anyway, I'll come up with a proposal for the next revision. And thanks a
> lot for your feedback.
>
> Cheers,
> --
> LuÃs
>