Re: [PATCH 1/2] aio: add vectored I/O support
From: Avi Kivity
Date: Sat Oct 16 2004 - 12:30:50 EST
Joel Becker wrote:
On Sat, Oct 16, 2004 at 10:43:04AM +0200, Avi Kivity wrote:
Using IO_CMD_READ for a vector entails
- converting the userspace structure (which might well an iovec) to iocbs
Why create an iov if you don't need to?
If you aren't writing directly to the kernel API, an iovec is very
convenient. It need not be an iovec, but surely you need _some_ vector.
- merging the iocbs
I don't see how this is different than merging iovs. Whether an
I/O range is represented by two segments of an iov or by two iocbs, the
elevator is going to merge them. If the userspace program had the
knowledge to merge them up front, it should have submitted one larger
segment.
No. An iovec is already merged; it is known that adjacent segments of an
iovec have adjacent offsets. a single IO_CMD_READV iovec can generate a
single bio without any merging.
The app did not submit a single large segment for the same reason
non-aio readv is used: because app memory is paged. in my case, a
userspace filesystem has a paged cache; large, disk-contiguous reads go
into many small noncontiguous memory pages. or it might be a database
performing a sequential scan and reading a large block into multiple
block buffers, which are usually discontiguous.
- coalescing the multiple completions in userspace to a single completion
You generally have to do this anyway. In fact, it is often far
more efficient and performant to have a pattern of:
submit 10;
reap 3; submit 3 more;
reap 6; submit 6 more;
repeat until you are done;
than to wait on all 10 before you can submit 10 again.
If the data is physically contiguous, it will (should) be merged, and
thus completed in a single event anyway. All 10 completions will happen
at the same time.
I might divide a 1M read into 4 iocbs to get the effect you mention. I
don't want to be forced into dividing them based on virtual address,
into 256 4K iocbs. *if* I wanted to do anything with partial data.
error handling is difficult as well. one would expect that a bad sector
with multiple iocbs would only fail one of the requests. it seems to be
non-trivial to implement this correctly.
I don't follow this. If you mean that you want all io from
later segments in an iov to fail if one segment has a bad sector, I
don't know that we can enforce it without running one segment at a
time. That's terribly slow.
That's not what I meant. If you submit 16 iocbs which are merged by the
kernel, and there is an error somewhere within the seventh iocb, I would
expect to get 15 success completions and one error completion. so error
information from the merged iocb must be demultiplexed into the originals.
If you have a single iocb, then any error simply fails that iocb.
Again, even if READV is a good idea, we need to fix whatever
inefficiencies io_submit() has. copying to/from userspace just can't be
that slow.
The inefficiencies I refered to were disk inefficiencies, not processor.
I think what happened was that the number of iocbs submitted (64 iocbs
of 4K each) did not merge because the device queue depth was very large;
no queuing occured because (I imagine) merging happens while a request
is waiting for disk readiness.
Decreasing the queue depth is not an option, because I might want to do
random reads of small iovecs later.
Of course, it is better to copy less than to copy more; so that is an
additional win for PREADV.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/