Re: [RFC]: Support for zero-copy TCP transmit of user space data

From: Vladislav Bolkhovitin
Date: Tue Dec 23 2008 - 14:11:34 EST


Jens Axboe, on 12/19/2008 10:27 PM wrote:
An iSCSI target driver iSCSI-SCST was a part of the patchset (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to have TCP zero-copy transmit of user space data was implemented. Patch, implementing this optimization was also sent in the patchset, see http://lkml.org/lkml/2008/12/10/296.
I'm probably ignorant of about 90% of the context here, but isn't this the sort of problem that was supposed to have been solved by vmsplice(2)?
No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, even if it was a user space driver, vmsplice wouldn't change anything much. It doesn't have a possibility for a user to know, when transmission of the data finished. So, it is intended to be used as: vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> vmsplice() it. But on the mmap() stage kernel has to zero all the newly mapped pages and zeroing memory isn't much faster, than copying it. Hence, there would be no considerable performance increase.
vmsplice() isn't the right choice, but splice() very well could be. You
could easily use splice internally as well. The vmsplice() part sort-of
applies in the sense that you want to fill pages into a pipe, which is
essentially what vmsplice() does. You'd need some helper to do that.
Sorry, Jens, but splice() works only if there is a file handle on the another side, so user space doesn't see data buffers. But SCST needs to serve a wider usage cases, like reading data with decompression from a virtual tape, where decompression is done in user space. For those only complete zero-copy network send, which I implemented, can give the best performance.

__splice_from_pipe() takes a pipe, a descriptor and an actor. There's
absolutely ZERO reason you could not reuse most of that for this
implementation. The big bonus here is that getting the put correct from
networking would even make splice() better for everyone. Win for Linux,
win for you since it'll make it MUCH easier for you to get this stuff
in. Looking at your original patch and I almost think it's a flame bait
to induce discussion (nothing wrong with that, that approach works quite
well and has been used before). There's no way in HELL that it'd ever be
a merge candidate. And I suspect you know that, at least I hope you do
or you are farther away from going forward with this than you think.

So don't look at splice() the system call, look at the infrastructure
and check if that could be useful for your case. To me it looks
absolutely like it could, if you goal is just zero-copy transmit.

I looked at the splice code again to make sure I don't miss anything. __splice_from_pipe() leads to pipe_to_sendpage(), which leads to sock_sendpage, then to sock->sendpage(). Sorry, but I don't see any point why to go over all the complicated splice infrastructure instead of directly call sock->sendpage(), as I do.

The
only missing piece is dropping the reference and signalling page
consumption at the right point, which is when the data is safe to be
reused. That very bit is missing, but that should be all as far as I can
tell.

This is exactly what I implemented in the patch we are discussing.

And
the ack-on-xmit-done bits is something that splice-to-socket needs
anyway, so I think it'd be quite a suitable choice for this.
So, are you writing that splice() could also benefit from the zero-copy transmit feature, like I implemented?

I like how you want to reinvent everything, perhaps you should spend a
little more time looking into various other approaches? splice() already
does zero-copy network transmit, there are no copies going on. Ideally,
you'd have zero copies moving data into your pipe, but migrade/move
isn't quite there yet. But that doesn't apply to your case at all.

What is missing, as I wrote, is the 'release on ack' and not on pipe
buffer release. This is similar to the get_page/put_page stuff you did
in your patch, but don't go claiming that zero-copy transmit is a
Vladislav original - the ->sendpage() does no copies.

Jens, I have never claimed I reinvented ->sendpage(). Quite opposite, I use it. I only extended it by a missing feature. Although, seems, since you were misleaded, I should apologize for not too good description of the patch.

Thanks,
Vlad



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/