Re: [PATCH] xfs: introduce object readahead to log recovery

From: Zhi Yong Wu
Date: Sun Jul 28 2013 - 21:38:37 EST


On Fri, Jul 26, 2013 at 7:35 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote:
>> Dave,
>>
>> All comments are good to me, and will be applied to next version, thanks a lot.
>>
>> On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@xxxxxxxxx wrote:
>> >> From: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
>> >>
>> >> It can take a long time to run log recovery operation because it is
>> >> single threaded and is bound by read latency. We can find that it took
>> >> most of the time to wait for the read IO to occur, so if one object
>> >> readahead is introduced to log recovery, it will obviously reduce the
>> >> log recovery time.
>> >>
>> >> In dirty log case as below:
>> >> data device: 0xfd10
>> >> log device: 0xfd10 daddr: 20480032 length: 20480
>> >>
>> >> log tail: 7941 head: 11077 state: <DIRTY>
>> >
>> > That's only a small log (10MB). As I've said on irc, readahead won't
>> Yeah, it is one 10MB log, but how do you calculate it based on the above info?
>
> length = 20480 blocks. 20480 * 512 = 10MB....
Thanks.
>
>> > And the recovery time from this is between 15-17s:
>> >
>> > ....
>> > log device: 0xfd20 daddr: 107374182032 length: 4173824
>> > ^^^^^^^ almost 2GB
>> > log tail: 19288 head: 264809 state: <DIRTY>
>> > ....
>> > real 0m17.913s
>> > user 0m0.000s
>> > sys 0m2.381s
>> >
>> > And runs at 3-4000 read IOPs for most of that time. It's largely IO
>> > bound, even on SSDs.
>> >
>> > With your patch:
>> >
>> > log tail: 35871 head: 308393 state: <DIRTY>
>> > real 0m12.715s
>> > user 0m0.000s
>> > sys 0m2.247s
>> >
>> > And it peaked at ~5000 read IOPS.
>> How do you know its READ IOPS is ~5000?
>
> Other monitoring. iostat can tell you this, though I use PCP...
thanks.
>
>> > Ok, so you've based the readahead on the transaction item list
>> > having a next pointer. What I think you should do is turn this into
>> > a readahead queue by moving objects to a new list. i.e.
>> >
>> > list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
>> >
>> > case XLOG_RECOVER_PASS2:
>> > if (ra_qdepth++ >= MAX_QDEPTH) {
>> > recover_items(log, trans, &buffer_list, &ra_item_list);
>> > ra_qdepth = 0;
>> > } else {
>> > xlog_recover_item_readahead(log, item);
>> > list_move_tail(&item->ri_list, &ra_item_list);
>> > }
>> > break;
>> > ...
>> > }
>> > }
>> > if (!list_empty(&ra_item_list))
>> > recover_items(log, trans, &buffer_list, &ra_item_list);
>> >
>> > I'd suggest that a queue depth somewhere between 10 and 100 will
>> > be necessary to keep enough IO in flight to keep the pipeline full
>> > and prevent recovery from having to wait on IO...
>> Good suggestion, will apply it to next version, thanks.
>
> FWIW, I hacked a quick test of this into your patch here and a depth
> of 100 brought the reocvery time down to under 8s. For other
> workloads which have nothing but dirty inodes (like fsmark) a depth
> of 100 drops the recovery time from ~100s to ~25s, and the iop rate
> is peaking at well over 15,000 IOPS. So we definitely want to queue
> up more than a single readahead...
Excited, I will try it.
By the way, how do you try the workload which has nothing but dirty
dquote objects?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx



--
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/