Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
From: David Howells
Date: Mon Jun 23 2025 - 10:20:48 EST
Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > The question is what should happen here to a memory span for which the
> > network layer or pipe driver is not allowed to take reference, but rather
> > must call a destructor? Particularly if, say, it's just a small part of a
> > larger span.
>
> What is a "span" in this context?
In the first case, I was thinking along the lines of a bio_vec that says
{physaddr,len} defining a "span" of memory. Basically just a contiguous range
of physical addresses, if you prefer.
However, someone can, for example, vmsplice a span of memory into a pipe - say
they add a whole page, all nicely aligned, but then they splice it out a byte
at a time into 4096 other pipes. Each of those other pipes now has a small
part of a larger span and needs to share the cleanup information.
Now, imagine that a network filesystem writes a message into a TCP socket,
where that message corresponds to an RPC call request and includes a number of
kernel buffers that the network layer isn't permitted to look at the refcounts
on, but rather a destructor must be called. The request message may transit
through the loopback driver and get placed on the Rx queue of another TCP
socket - from whence it may be spliced off into a pipe.
Alternatively, if virtual I/O is involved, this message may get passed down to
a layer outside of the system (though I don't think this is, in principle, any
different from DMA being done by a NIC).
And then there's relayfs and fuse, which seem to do weird stuff.
For the splicing of a loop-backed kernel message out of a TCP socket, it might
make sense just to copy the message at that point. The problem is that the
kernel doesn't know what's going to happen next to it.
> In general splice unlike direct I/O relies on page reference counts inside
> the splice machinery. But that is configurable through the
> pipe_buf_operations. So if you want something to be handled by splice that
> does not use simple page refcounts you need special pipe_buf_operations for
> it. And you'd better have a really good use case for this to be worthwhile.
Yes. vmsplice, is the equivalent of direct I/O and should really do the same
pinning thing that, say, write() to an O_DIRECT file does.
David