Re: [RFC][PATCH] direct-io: btrfs: avoid splitting dio requestsfor non-btrfs filesystems

From: Josef Bacik
Date: Tue Nov 02 2010 - 10:57:33 EST


On Tue, Nov 02, 2010 at 01:18:42PM +0100, Christian Ehrhardt wrote:
> Hi,
> this is about an issue newer kernels show, bysplitting direct I/O requests
> into 4k pieces to directly merge them in the Block Device Layer afterwards.
>
> If anyone is interested in own tests just use a simple command like
> dd if=/mnt/test/test-dd1 of=/dev/null iflag=direct bs=64k count=1
> in combination with blktrace.
>
> The following patch is more a proposal for discussion than a solution, well
> thats what RFC's are about right.
> I'm unsure about names, but also if the approach in general is the right way.
> It should apply to every 2.6.36 and 2.6.37 kernel.
>
> I put everyone on CC who was involved in the patches leading to the current
> behavior.
>
> Grüsse / regards,
> Christian Ehrhardt
> IBM Linux Technology Center, System z Linux Performance
>
> --- cut here ---
>
> Subject: [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests for non-btrfs filesystems
>
> From: Christian Ehrhardt <ehrhardt@xxxxxxxxxxxxxxxxxx>
>
> Commit c2c6ca41 by Josef Bacik <josef@xxxxxxxxxx> caused all direct I/O's to
> be split into 4k requests before arriving in the block device layer.
> This was later on partially fixed by Jeff Moyer <jmoyer@xxxxxxxxxx> in
> 7a801ac6.
>
> Jeffs fix improved the situation a lot, but eventually it still splits I/Os
> for non-btrfs file systems as well were it wouldn't have to.
>
> Eventually in my example on a ext2 filesystem it splits it every 4Mb where
> dio->boundary is evaluated to true.
>
> In blktrace this looks like:
> dd-910 [002] 38.762523: 94,8 A R 131264 + 8 <- (94,9) 131072
> dd-910 [002] 38.762531: 94,8 Q R 131264 + 8 [dd]
> dd-910 [002] 38.762535: 94,8 G R 131264 + 8 [dd]
> dd-910 [002] 38.762537: 94,8 P N [dd]
> dd-910 [002] 38.762539: 94,8 I R 131264 + 8 [dd]
> dd-910 [002] 38.762544: 94,8 A R 131272 + 8 <- (94,9) 131080
> dd-910 [002] 38.762544: 94,8 Q R 131272 + 8 [dd]
> dd-910 [002] 38.762546: 94,8 M R 131272 + 8 [dd]
> dd-910 [002] 38.762550: 94,8 A R 131280 + 8 <- (94,9) 131088
> dd-910 [002] 38.762551: 94,8 Q R 131280 + 8 [dd]
> dd-910 [002] 38.762551: 94,8 M R 131280 + 8 [dd]
> dd-910 [002] 38.762556: 94,8 A R 131288 + 8 <- (94,9) 131096
> dd-910 [002] 38.762557: 94,8 Q R 131288 + 8 [dd]
> dd-910 [002] 38.762557: 94,8 M R 131288 + 8 [dd]
> dd-910 [002] 38.762562: 94,8 A R 131296 + 8 <- (94,9) 131104
> dd-910 [002] 38.762563: 94,8 Q R 131296 + 8 [dd]
> dd-910 [002] 38.762564: 94,8 M R 131296 + 8 [dd]
> dd-910 [002] 38.762568: 94,8 A R 131304 + 8 <- (94,9) 131112
> dd-910 [002] 38.762569: 94,8 Q R 131304 + 8 [dd]
> dd-910 [002] 38.762570: 94,8 M R 131304 + 8 [dd]
> dd-910 [002] 38.762577: 94,8 A R 131312 + 8 <- (94,9) 131120
> dd-910 [002] 38.762578: 94,8 Q R 131312 + 8 [dd]
> dd-910 [002] 38.762579: 94,8 M R 131312 + 8 [dd]
> dd-910 [002] 38.762584: 94,8 A R 131320 + 8 <- (94,9) 131128
> dd-910 [002] 38.762584: 94,8 Q R 131320 + 8 [dd]
> dd-910 [002] 38.762585: 94,8 M R 131320 + 8 [dd]
> dd-910 [002] 38.762590: 94,8 A R 131328 + 8 <- (94,9) 131136
> dd-910 [002] 38.762590: 94,8 Q R 131328 + 8 [dd]
> dd-910 [002] 38.762591: 94,8 M R 131328 + 8 [dd]
> dd-910 [002] 38.762596: 94,8 A R 131336 + 8 <- (94,9) 131144
> dd-910 [002] 38.762597: 94,8 Q R 131336 + 8 [dd]
> dd-910 [002] 38.762598: 94,8 M R 131336 + 8 [dd]
> dd-910 [002] 38.762605: 94,8 A R 131344 + 16 <- (94,9) 131152
> dd-910 [002] 38.762607: 94,8 Q R 131344 + 16 [dd]
> dd-910 [002] 38.762608: 94,8 M R 131344 + 16 [dd]
> dd-910 [002] 38.762611: 94,8 A R 131368 + 32 <- (94,9) 131176
> dd-910 [002] 38.762612: 94,8 Q R 131368 + 32 [dd]
> dd-910 [002] 38.762616: 94,8 G R 131368 + 32 [dd]
> dd-910 [002] 38.762617: 94,8 I R 131368 + 32 [dd]
> dd-910 [002] 38.762619: 94,8 U N [dd] 2
> dd-910 [002] 38.762621: 94,8 D R 131264 + 96 [dd]
> dd-910 [002] 38.762625: 94,8 D R 131368 + 32 [dd]
> <idle>-0 [012] 38.763363: 94,8 C R 131264 + 96 [0]
> <idle>-0 [015] 38.763797: 94,8 C R 131368 + 32 [0]
>
> The usual behavior before both commits was:
> dd-919 [002] 37.513685: 94,8 A R 7824 + 96 <- (94,9) 7632
> dd-919 [002] 37.513693: 94,8 Q R 7824 + 96 [dd]
> dd-919 [002] 37.513697: 94,8 G R 7824 + 96 [dd]
> dd-919 [002] 37.513700: 94,8 P N [dd]
> dd-919 [002] 37.513701: 94,8 I R 7824 + 96 [dd]
> dd-919 [002] 37.513794: 94,8 A R 7928 + 32 <- (94,9) 7736
> dd-919 [002] 37.513795: 94,8 Q R 7928 + 32 [dd]
> dd-919 [002] 37.513800: 94,8 G R 7928 + 32 [dd]
> dd-919 [002] 37.513802: 94,8 I R 7928 + 32 [dd]
> dd-919 [002] 37.513804: 94,8 U N [dd] 2
> dd-919 [002] 37.513806: 94,8 D R 7824 + 96 [dd]
> dd-919 [002] 37.513810: 94,8 D R 7928 + 32 [dd]
> <idle>-0 [011] 37.514362: 94,8 C R 7824 + 96 [0]
> <idle>-0 [014] 37.514728: 94,8 C R 7928 + 32 [0]
>
> That remaining split is cause by the test for:
> "dio->final_block_in_bio != dio->cur_page_block".
> As this was in the code for a long time I just assume it is right.
>
> So eventually for the 64k request in the example this patch improves the
> effective submissions that get to the block device layer from:
> 10x4k, 1x8k, 1x16k to 1x48k & 1x16k which is much better.
>
> Througput impact is small, but in terms of cpu consumption this is visible
> by a single digit percentage depending on the incoming request size.
>
> The solution looking for comments or alternatives in this RFC patch adds a new
> kiocb flag that let filesystems specify if they need these workaround to
> separate meta data reads - if not, like all pre-btrfs filesystems the dio code
> doesn't have to cause this extra work.
>

So this brings up an important question, which is how badly do we want to honor
buffer_boundary()? The whole reason I added the logical offset check was because
we ignore buffer_boundary() if there is no bio currently. So for BTRFS we would
do this

map extent
set buffer boundary

but if this is the first part of the IO and there isn't a bio already setup,
dio_new_bio clears dio->boundary, so the next extent we would get may not be
logically contiguous and hilarity would ensue. I chose not to fix the
dio->boundary clearing because I was worried I would affect other fs's workloads
(which I did anyway because of my bug). So maybe the right idea is to rip out
my logical offset tests altogether and fix dio so we treat buffer_boundary()
like gospel. That way Btrfs can get what it needs without having this weird
special code, and then we can look at how other fs's set buffer_boundary (I'm
pretty sure ext2/3 are the only ones) and make sure they are setting it when
they really mean to. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/