ioevent queues (was Re: Proposed new poll2() syscall)

Dean Gaudet (dgaudet-list-linux-kernel@arctic.org)
Sat, 23 Aug 1997 09:34:28 -0700 (PDT)


Warning, this is long, but I think worth it. If you've heard of NT's
completion ports that's where I'm heading.

Here's something that's more ambitious than a new poll(), but is also
really important for scalable I/O. The problem with both poll() and
select() is that they both require the kernel and the user to deal with a
list of all possible events, rather than a list of exactly the events that
have occured. In a multithreaded/multiprocess application you get into
even worse situations where multiple workers get the same (or similar)
list of events and then fight with each other as to who will do what.

For example, consider Apache's child_main loop which has to accept()
connections from multiple sockets. Here's the naive implementation:

for (;;) {
for (;;) {
copy listen_fd_list;
select (n, &listen_fd_list, NULL, NULL, NULL);

fd = -1;
for (i = 0; i < n; ++i) {
if (FD_ISSET (i, &listen_fd_list)) {
if ((fd = accept (i, &sa_client, &clen)) != -1) {
break;
}
}
}
if (fd != -1) break;
}
process the request on fd;
}

What's wrong with this? Well remember that there is more than one process
doing exactly the same loop. If you look carefully there's a huge
starvation problem -- the listening sockets are blocking. All processes
will block on accept() except the single lucky one that actually got a
socket.

The next naive change might be to make the listening sockets non-blocking.
But now look, you've got a really busy loop there. One lucky process gets
the new socket, the rest spin gleefully and end up back in select blocked.
Only one process actually accomplished anything, the other 10, 20
processes did nothing.

The actual fix used in Apache is to use a locking primative (in 1.2 the
options are fcntl or flock, in 1.3 there's many more options for other
unixes). The lock is acquired before entering the select/accept loop, and
released on exit. So only one process is actually busy trying to get a
new socket at any time. It's about the most efficient alternative we've
found for unix systems.

You could argue that the real problem is that Apache is a multiprocess
model, and that a multithreaded model where one thread does the
select/accept loop and then launches another thread to handle the request
is a better solution. I'll disagree with you for many reasons, but the
one relevant to this discussion is that thousands of threads don't scale.
(Neither do thousands of children.) There just aren't enough CPUs in the
average box to make it worthwhile to have thousands of units of execution
simultaneously.

Now suppose you had a magic pipe that spit out structures looking like
this:

struct ioevent {
void *user_supplied_pointer;
};

The ioevent is an indication that some asynchronous i/o event is now
ready. It could be a socket is ready for accept, or a descriptor is ready
for writing (or even more cool things that are difficult to multiplex in
unix right now, such as a child has exited). The user_supplied_pointer is
just that, a pointer that you told the kernel to associate with a
particular descriptor (through fcntl() probably).

Now you can have a bunch of worker threads essentially perform this loop:

for (;;) {
struct ioevent ev;

read (magic_pipe, &ev, sizeof (ev));

process the io event;

perform another asynchronous io;
}

No spinning, no complex locking, no on-the-fly thread creation. No
needless scanning over descriptors which have no event associated with
them.

The next step on this is to create a user-level threads package which is
co-operatively multitasked, and runs inside a set of worker kernel level
threads. All i/o is done asynchronously, and the above loop becomes the
main decision loop as to which user-level thread to run. This makes it
easy to write programs thinking they have thousands of threads to work
with, but really they only have enough concurrency to saturate the
machine. The result is a very scalable application.

NT has all of this, called completion ports and fibers. If you do a
search on microsoft.com you'll be able to find a paper by a fellow named
Vert in which he describes this stuff. I'm not proposing that we copy
their API at all, in fact I haven't even looked at it so I might be off
base here.

Here's my suggested API:

struct ioevent {
void *ioe_user; /* user_supplied_pointer was for pedagogical purposes*/
int ioe_result; /* system call result code */
int ioe_errno; /* system call errno */
};

ioeventpipe (int nevents)
- returns a descriptor to an ioevent_pipe capable of handling nevents
outstanding, possibly uncompleted events.

fcntl (fd, F_SETIOEVENTPIPE, ioeventpipe_fd, user)
- all i/o on fd will be done asynchronously and results will be
queued on ioeventpipe

fcntl (fd, F_GETIOEVENTPIPE, &ioeventpipe_fd, &user)
- return the associated setting

New errno:
EIOEVENTFULL possibly returned by any i/o operation on a
descriptor which is attached to an ioeventpipe. This is
returned when the ioeventpipe is full. In this case the
i/o has not been scheduled.

EIOEVENTQUEUED returned by any i/o operation on a descriptor which
was successfully queued on its ioeventpipe.

The system call happens asynchronously, and all buffers, or other
structures given to it are considered "in use" because the kernel
will complete the call and stuff the result code and errno into
the struct ioevent. This is to eliminate the need to make the
io call twice -- i.e. once to cause it to be queued, and then once
to get the result.

int fork2 (int ioeventpipe_fd, void *user)
- fork(), but when the child exits an event will be scheduled on
the fd. (This makes it far easier to write programs which
have to both watch for dying children and watch for i/o
events.)

I think that covers it ... I'm sure the kernel hackers will tell me how
unreasonable the API is ;)

Dean