Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs

From: Corrado Zoccolo
Date: Thu Mar 04 2010 - 15:35:00 EST


Hi Vivek,
On Thu, Mar 4, 2010 at 12:28 AM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
>> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
>> >> CFQ currently applies the same logic of detecting seeky queues and
>> >> grouping them together for rotational disks as well as SSDs.
>> >> For SSDs, the time to complete a request doesn't depend on the
>> >> request location, but only on the size.
>> >> This patch therefore changes the criterion to group queues by
>> >> request size in case of SSDs, in order to achieve better fairness.
>> >
>> > Hi Corrado,
>> >
>> > Can you give some numbers regarding how are you measuring fairness and
>> > how did you decide that we achieve better fairness?
>> >
>> Please, see the attached fio script. It benchmarks pairs of processes
>> performing direct random I/O.
>> One is always fixed at bs=4k , while I vary the other from 8K to 64K
>> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
>> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1
>> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
>> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1
>> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>>
>> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running
>> a fio script with high number of parallel readers to make sure ncq
>> detection is stabilized, I get the following:
>> Run status group 0 (all jobs):
>> Â ÂREAD: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s,
>> mint=5001msec, maxt=5003msec
>>
>> Run status group 1 (all jobs):
>> Â ÂREAD: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s,
>> mint=5002msec, maxt=5003msec
>>
>> Run status group 2 (all jobs):
>> Â ÂREAD: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s,
>> mint=5001msec, maxt=5004msec
>>
>> Run status group 3 (all jobs):
>> Â ÂREAD: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s,
>> maxb=12486KiB/s, mint=5002msec, maxt=5004msec
>>
>> As you can see from minb, the process with smallest I/O size is
>> penalized (the fact is that being both marked as noidle, they both end
>> up in the noidle tree, where they are serviced round robin, so they
>> get fairness in term of IOPS, but bandwidth varies a lot.
>>
>> With my patches in place, I get:
>> Run status group 0 (all jobs):
>> Â ÂREAD: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s,
>> mint=5002msec, maxt=5003msec
>>
>> Run status group 1 (all jobs):
>> Â ÂREAD: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s,
>> mint=5001msec, maxt=5003msec
>>
>> Run status group 2 (all jobs):
>> Â ÂREAD: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s,
>> mint=5002msec, maxt=5003msec
>>
>> Run status group 3 (all jobs):
>> Â ÂREAD: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s,
>> maxb=8548KiB/s, mint=5001msec, maxt=5006msec
>>
>> The process doing smaller requests is now not penalized by the fact
>> that it is run concurrently with the other one, and the other still
>> benefits from larger requests because it uses better its time slice.
>>
>
> Hi Corrado,
>
> I ran your fio script on my SSD which supports NCQ. In my case both the
> processes lose.
>
> Vanilla kernel (2.6.33)
> =======================
> Run status group 0 (all jobs):
> Â READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 1 (all jobs):
> Â READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 2 (all jobs):
> Â READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 3 (all jobs):
> Â READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s,
> mint=5001msec, maxt=5001msec
>
> With two seek patches applied
> =============================
> Run status group 0 (all jobs):
> Â READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 1 (all jobs):
> Â READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 2 (all jobs):
> Â READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 3 (all jobs):
> Â READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s,
> mint=5001msec, maxt=5001msec
>
> Note, how overall BW suffers for group 2 and group 3. This needs some
> debugging why it is happening. I guess we are idling on sync-noidle
> tree and not idling on sync-idle tree, probably that's why bigger request
> size process can't get enough IO pushed. But then smaller request size
> process should have got more IO done but that also does not seem to
> be happening.

I think this is a preexisting bug. You can probably trigger it in
vanilla 2.6.33 by having one process doing random and an other doing
sequential reads.
I think the issue is:
* cfq_should_idle returns true for the last queue on a tree, even for NCQ SSDs:

static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
enum wl_prio_t prio = cfqq_prio(cfqq);
struct cfq_rb_root *service_tree = cfqq->service_tree;

BUG_ON(!service_tree);
BUG_ON(!service_tree->count);

/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
return false;

/* We do for queues that were marked with idle window flag. */
if (cfq_cfqq_idle_window(cfqq) &&
!(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
return true;

/*
* Otherwise, we do only if they are the last ones
* in their service tree.
*/
return service_tree->count == 1 && cfq_cfqq_sync(cfqq);
}

* Even if we never activate a timer, we wait for request completion by
keeping a queue that has requests in flight, when cfq_should_idle
returns true (in two places):
/*
* The active queue has run out of time, expire it and select
new.
*/
if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq)) {
/*
* If slice had not expired at the completion of last
request
* we might not have turned on wait_busy flag. Don't
expire
* the queue yet. Allow the group to get backlogged.
*
* The very fact that we have used the slice, that
means we
* have been idling all along on this queue and it
should be
* ok to wait for this request to complete.
*/
if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
&& cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
cfqq = NULL;
goto keep_queue;
} else
goto expire;
}

/*
* No requests pending. If the active queue still has requests
in
* flight or is idling for a new request, allow either of
these
* conditions to happen (or time out) before selecting a new
queue.
*/
if (timer_pending(&cfqd->idle_slice_timer) ||
(cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
cfqq = NULL;
goto keep_queue;
}

Can you change cfq_should_idle to always return false for NCQ SSDs and retest?

Thanks
Corrado

>
> Both small request size and big request size processes are losing.
>
> Thanks
> Vivek
>
>> > In case of SSDs with NCQ, we will not idle on any of the queues (either
>> > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes
>> > if we mark a queue as seeky/non-seeky on SSD?
>> >
>>
>> I've not tested on NCQ SSD, but I think at worst it will not harm, and
>> at best, it will provide similar fairness improvements when the queue
>> of processes submitting requests grows above the available NCQ slots.
>>
>> > IOW, looking at this patch, now any queue doing IO in smaller chunks than
>> > 32K on SSD will be marked as seeky. How does that change the behavior in
>> > terms of fairness for the queue?
>> >
>> Basically, we will have IOPS based fairness for small requests, and
>> time based fairness for larger requests.
>>
>> Thanks,
>> Corrado
>>
>> > Thanks
>> > Vivek
>> >
>> >>
>> >> Signed-off-by: Corrado Zoccolo <czoccolo@xxxxxxxxx>
>> >> ---
>> >> Âblock/cfq-iosched.c | Â Â7 ++++++-
>> >> Â1 files changed, 6 insertions(+), 1 deletions(-)
>> >>
>> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> >> index 806d30b..f27e535 100644
>> >> --- a/block/cfq-iosched.c
>> >> +++ b/block/cfq-iosched.c
>> >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4;
>> >> Â#define CFQ_SERVICE_SHIFT Â Â Â 12
>> >>
>> >> Â#define CFQQ_SEEK_THR Â Â Â Â Â Â Â Â(sector_t)(8 * 100)
>> >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
>> >> Â#define CFQQ_SEEKY(cfqq) Â Â (hweight32(cfqq->seek_history) > 32/8)
>> >>
>> >> Â#define RQ_CIC(rq) Â Â Â Â Â \
>> >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >> Â Â Â Â Â Â Â Â Â Â Âstruct request *rq)
>> >> Â{
>> >> Â Â Â sector_t sdist = 0;
>> >> + Â Â sector_t n_sec = blk_rq_sectors(rq);
>> >> Â Â Â if (cfqq->last_request_pos) {
>> >> Â Â Â Â Â Â Â if (cfqq->last_request_pos < blk_rq_pos(rq))
>> >> Â Â Â Â Â Â Â Â Â Â Â sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
>> >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >> Â Â Â }
>> >>
>> >> Â Â Â cfqq->seek_history <<= 1;
>> >> - Â Â cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>> >> + Â Â if (blk_queue_nonrot(cfqd->queue))
>> >> + Â Â Â Â Â Â cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
>> >> + Â Â else
>> >> + Â Â Â Â Â Â cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>> >> Â}
>> >>
>> >> Â/*
>> >> --
>> >> 1.6.4.4
>> >
>
>
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo@xxxxxxxxx
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/