Re: [PATCH 08/17] bio: reimplement bio_copy_user_iov()

From: Tejun Heo
Date: Wed Apr 01 2009 - 19:58:21 EST


Hello, Boaz.

Boaz Harrosh wrote:
> I've looked hard and deep into your patchset, and I would like to
> suggest an improvement.
>
> [Option 1]
> What your code is actually using from sgl-code base is:
> for_each_sg
> sg_mapping_iter and it's
> sg_miter_start, sg_miter_next
> ... (what else)
>
> I would like if you can define above for bvec(s) just the way you like
> them. Then code works directly on the destination bvect inside the final
> bio. One less copy no intermediate allocation, and no kmalloc of
> bigger-then-page buffers.
>
> These are all small inlines, duplicating those will not affect
> Kernel size at all. You are not using the chaining ability of sgl(s)
> so it can be simplified. You will see that not having the intermediate
> copy simplifies the code even more.
>
> Since no out-side user currently needs sgl(s) no functionality is lost.

Yeah, we can switch to dealing with bvecs instead of sgls. For kernel
mappings, it wouldn't make any difference. For user single mappings,
it'll remove an intermediate step. For user sg mappings, if we make
gup use sgl, it won't make much difference.

Frankly, for copy paths, I don't think all this matters at all. If
each request is small, other overheads will dominate. For large
requests, copying to and from sgl isn't something worth worrying
about.

For mapping paths, if we change gup to use sgl (we really can't make
it swallow bvec), I don't think it will make whole lot of difference
one way or the other. It might end up using slightly more cachelines,
but that will be about it.

I do agree using bvec as internal page carrier would be nicer (sans
having to implement stuff manually) but I'm not convinced whether
it'll be worth the added code.

> [Option 2]
> Keep pointer to sgl and not bvec at bio, again code works on final
> destination. Later users of block layer that call blk_rq_fill_sgl
> (blk_rq_map_sg) will just get a copy of the pointer and another
> allocation and copy is gained. This option will spill outside of
> the current patches scope. Into bvec hacking code.
>
> I do like your long term vision of separating the DMA part from the
> virtual part of scatterlists. Note how they are actually two
> disjoint lists altogether. After the dma_map does its thing the dma
> physical list might be shorter then virtual and sizes might not
> correspond at all. The dma mapping code regards the dma part as an
> empty list that gets appended while processing, any segments match
> is accidental. (That is: inside the scatterlist the virtual address
> most probably does not match the dma address)
>
> So [option 1] matches more closely to that vision.
>
> Historically code was doing
> Many-sources => scatterlist => biovec => scatterlist => dma-scatterlist
>
> Only at 2.6.30 we can say that we shorten a step to do:
> Many-sources => biovec => scatterlist => dma-scatterlist
>
> Now you want to return the extra step, I hate it.

It's not my favorite part either but I'm just not convinced it matters
all that much.

> [Option 2] can make that even shorter.
> Many-sources => scatterlist => dma-scatterlist

Yeah, this is the root cause. We don't have a common API to carry
segments so we end up doing impedence matching when crossing
boundaries. Drivers expect sgl at the interface. Filesystems
obviously expect bio with bvecs. gup expects array of pages (this one
isn't too bad tho).

> Please consider [option 1] it will only add some source code
> but it will not increase code size, maybe it will decrease,
> and it will be fast.
>
> Please consider that this code-path is used by me, in exofs and
> pNFS-objcets in a very very hot path, where memory pressure is a
> common scenario.

I'm not hugely against using bvec inside. I just didn't see much
difference and went for something easier, so yeah, it definitely is an
option.

Jens, Tomo, what do you guys think?

> And I have one more question. Are you sure kmalloc of
> bigger-then-page buffers are safe?

It has higher chance of failure but is definitely safe.

> As I understood it, that tries to allocate physically contiguous
> pages which degrades as time passes, and last time I tried this with
> a kmem_cache (do to a bug) it crashed the kernel randomly after 2
> minutes of use.

Then, that's a bug. Yeah, high order allocations fail more easily as
time passes. But low order (1 maybe 2) allocations aren't too bad.
If it matters we can make bio_kmalloc() use vmalloc() for bvecs if it
goes over PAGE_SIZE but again given that nobody reported the spurious
GFP_DMA in bounce_gfp which triggers frenzy OOM killing spree for
large SG_IOs, I don't think this matters all that much.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/