Re: O_DIRECT to md raid 6 is slow

From: Andy Lutomirski
Date: Wed Aug 15 2012 - 18:11:01 EST


On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>> <john.robinson@xxxxxxxxxxxxxxxx> wrote:
>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>
>>>> If I do:
>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>
>>> [...]
>>>
>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>> I'm in O_DIRECT mode.
>>>
>>>
>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>
>> Crud.
>>
>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>> 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [6/6] [UUUUUU]
>>
>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>> chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector
>> (i.e. 1MB) boundary.
>
> It's time to blow away the array and start over. You're already
> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
> but for a handful of niche all streaming workloads with little/no
> rewrite, such as video surveillance or DVR workloads.
>
> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why:
> Deleting a single file changes only a few bytes of directory metadata.
> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
> modify the directory block in question, calculate parity, then write out
> 3MB of data to rust. So you consume 6MB of bandwidth to write less than
> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify
> a few bytes of metadata. Yes, insane.

Grr. I thought the bad old days of filesystem and related defaults
sucking were over. cryptsetup aligns sanely these days, xfs is
sensible, etc. wtf? <rant>Why is there no sensible filesystem for
huge disks? zfs can't cp --reflink and has all kinds of source
availability and licensing issues, xfs can't dedupe at all, and btrfs
isn't nearly stable enough.</rant>

Anyhow, I'll try the patch from Wu Fengguang. There's still a bug here...

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/