Re: Thread implementations...

Larry McVoy (lm@bitmover.com)
Thu, 25 Jun 1998 14:27:26 -0700


: > : caddr_t buf = mmap(0, len, PROT_READ, MAP_FILE | MAP_SHARED, ifd, 0);
: > : write(ofd, buf, len);
: >
: > 1) Your model above still does a copy.
:
: I have to admit, I can't see where.

OK, so read() is basically

find the from pages
bcopy(from_pages, to_user_virtual_address)

and write is

find the destinion pages
bcopy(from_user_virtual_address, dest_pages)

Your example above is passing a mmap region to write(). Unless you go
teach write about page flipping, or unless you lock the pages and sleep
the process calling write, you have to bcopy out of the mmapped region
into the destination pages (or skb buffers).

If you look at the mentioned splice() interfaces, you can see that it gets
things down to the DMA in from disk and the DMA out to network (or the other
way).

: Realistically, PC hardware can't do TCP checksumming, so
: the best you can do is 1/3 memory speed. DMA in, checksum,
: DMA out. Or can you copy-and-checksum directly to the
: buffers of the Ethernet card?

The splice() model was set up so that if you had cards that could scatter
gather & could checksum, you could send the data without ever having the
processor touch it.

For this sort of performance, job #1 is "processors must not touch the data".

: > 2) On SGI's, for server type of operations, the mmap() is the bottleneck.
: > You are setting up and tearing down a virtual mapping that you don't
: > need: the ``currency'' you are dealing in at both ends is physical
: > pages, not virtual pages. This starts to become a bottleneck for
: > files smaller than 8K (Linux) or 32K (most other operating systems).
: > Linux is better because it is lighter.
:
: I must admit I don't understand why this is so. Surely the mmap
: just sets up some kernel structures, it doesn't actually create
: any virtual-physical mapping. Doesn't that happen when we fault
: and the memory is read in. So isn't the overhead per page, and not
: per mapping? (obviously not, but why?).

The mapping cost is not free, alpha8 of lmbench2 will try and
quantify this in the next few days.

It's needless, non-zero work. Let's put some numbers in here.
A decent uniprocessor can approach 2000 HTTP GET's/second. That's
500 usecs/GET. In that time, you have to do this:

establish a TCP socket (SYN, SYN reply, SYN2 or whatever it is called),
that's two receives and a send.
wake up httpd and get the data (context switch, 2-3 syscalls at least)
find the file and send it (open, mmap, write, close)
close the socket

Not to mention all of the logging and security checks needed. Do an strace
of httpd some time.

Anyway, in that sort of world, every single microsecond needs to be
accounted for and justified. Given that httpd isn't looking at the
static data, why the heck is the kernel translating from physical pages
to virtual pages and back again, when all it wanted was physical pages
on both ends of the transaction?

That question is the whole motivation for splice().

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu