Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34

From: Corrado Zoccolo
Date: Wed Mar 03 2010 - 17:39:22 EST


On Tue, Mar 2, 2010 at 12:01 AM, Corrado Zoccolo <czoccolo@xxxxxxxxx> wrote:
> Hi Vivek,
> On Mon, Mar 1, 2010 at 5:35 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
>> On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote:
>>>
>>> Hi, I'm resending the rework seeky detection patch, together with
>>> the companion patch for SSDs, in order to get some testing on more
>>> hardware.
>>>
>>> The first patch in the series fixes a regression introduced in 2.6.33
>>> for random mmap reads of more than one page, when multiple processes
>>> are competing for the disk.
>>> There is at least one HW RAID controller where it reduces performance,
>>> though (but this controller generally performs worse with CFQ than
>>> with NOOP, probably because it is performing non-work-conserving
>>> I/O scheduling inside), so more testing on RAIDs is appreciated.
>>>
>>
>> Hi Corrado,
>>
>> This time I don't have the machine where I had previously reported
>> regressions. But somebody has exported me two Lun from an storage box
>> over SAN and I have done my testing on that. With this seek patch applied,
>> I still see the regressions.
>>
>> iosched=cfq   Filesz=1G  bs=64K
>>
>> Â Â Â Â Â Â Â Â Â Â Â Â2.6.33 Â Â Â Â Â Â Â2.6.33-seek
>> workload ÂSet NR ÂRDBW(KB/s) ÂWRBW(KB/s) ÂRDBW(KB/s) ÂWRBW(KB/s) Â Â%Rd %Wr
>> -------- Â--- -- Â---------- Â---------- Â---------- Â---------- Â ---- ----
>> brrmmap  3  1  7113    Â0      7044    Â0       Â0% 0%
>> brrmmap  3  2  6977    Â0      6774    Â0       -2% 0%
>> brrmmap  3  4  7410    Â0      6181    Â0      Â-16% 0%
>> brrmmap  3  8  9405    Â0      6020    Â0      Â-35% 0%
>> brrmmap  3  16 Â11445    0      5792    Â0      Â-49% 0%
>>
>> Â Â Â Â Â Â Â Â Â Â Â Â2.6.33 Â Â Â Â Â Â Â2.6.33-seek
>> workload ÂSet NR ÂRDBW(KB/s) ÂWRBW(KB/s) ÂRDBW(KB/s) ÂWRBW(KB/s) Â Â%Rd %Wr
>> -------- Â--- -- Â---------- Â---------- Â---------- Â---------- Â ---- ----
>> drrmmap  3  1  7195    Â0      7337    Â0       Â1% 0%
>> drrmmap  3  2  7016    Â0      6855    Â0       -2% 0%
>> drrmmap  3  4  7438    Â0      6103    Â0      Â-17% 0%
>> drrmmap  3  8  9298    Â0      6020    Â0      Â-35% 0%
>> drrmmap  3  16 Â11576    0      5827    Â0      Â-49% 0%
>>
>>
>> I have run buffered random reads on mmaped files (brrmmap) and direct
>> random reads on mmaped files (drrmmap) using fio. I have run these for
>> increasing number of threads and did this for 3 times and took average of
>> three sets for reporting.

BTW, I think O_DIRECT doesn't affect mmap operation.

>>
>> I have used filesize 1G and bz=64K and ran each test sample for 30
>> seconds.
>>
>> Because with new seek logic, we will mark above type of cfqq as non seeky
>> and will idle on these, I take a significant hit in performance on storage
>> boxes which have more than 1 spindle.
Thinking about this, can you check if your disks have a non-zero
/sys/block/sda/queue/optimal_io_size ?
>From the comment in blk-settings.c, I see this should be non-zero for
RAIDs, so it may help discriminating the cases we want to optimize
for.
It could also help in identifying the correct threshold.
>
> Thanks for testing on a different setup.
> I wonder if the wrong part for multi-spindle is the 64kb threshold.
> Can you run with larger bs, and see if there is a value for which
> idling is better?
> For example on a 2 disk raid 0 I would expect Âthat a bs larger than
> the stripe will still benefit by idling.
>
>>
>> So basically, the regression is not only on that particular RAID card but
>> on other kind of devices which can support more than one spindle.
Ok makes sense. If the number of sequential pages read before jumping
to a random address is smaller than the raid stripe, we are wasting
potential parallelism.
>>
>> I will run some test on single SATA disk also where this patch should
>> benefit.
>>
>> Based on testing results so far, I am not a big fan of marking these mmap
>> queues as sync-idle. I guess if this patch really benefits, then we need
>> to first put in place some kind of logic to detect whether if it is single
>> spindle SATA disk and then on these disks, mark mmap queues as sync.
>>
>> Apart from synthetic workloads, in practice, where this patch is helping you?
>
> The synthetic workload mimics the page fault patterns that can be seen
> on program startup, and that is the target of my optimization. In
> 2.6.32, we went the direction of enabling idling also for seeky
> queues, while 2.6.33 tried to be more friendly with parallel storage
> by usually allowing more parallel requests. Unfortunately, this
> impacted this peculiar access pattern, so we need to fix it somehow.
>
> Thanks,
> Corrado
>
>>
>> Thanks
>> Vivek
>>
>>
>>> The second patch changes the seeky detection logic to be meaningful
>>> also for SSDs. A seeky request is one that doesn't utilize the full
>>> bandwidth for the device. For SSDs, this happens for small requests,
>>> regardless of their location.
>>> With this change, the grouping of "seeky" requests done by CFQ can
>>> result in a fairer distribution of disk service time among processes.
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/