Re: Linux's implementation of poll() not scalable?

From: Linus Torvalds (torvalds@transmeta.com)
Date: Tue Oct 24 2000 - 13:33:57 EST

Next message: Martin Dalecki: "Re: PATCH: killing read_ahead[]"
Previous message: Guest section DW: "Re: IDE-Floppy and devfs"
In reply to: Dan Kegel: "Re: Linux's implementation of poll() not scalable?"
Next in thread: Dan Kegel: "Re: Linux's implementation of poll() not scalable?"
Reply: Dan Kegel: "Re: Linux's implementation of poll() not scalable?"
Reply: Edgar Toernig: "Re: Linux's implementation of poll() not scalable?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 24 Oct 2000, Dan Kegel wrote:
>
> But user code currently written for poll() has the luxury of dropping
> events because poll() will happily report on the current readiness of
> the socket every time. /dev/poll is level-triggered because it's trying
> to make conversion of poll()-based code easy. With your scheme,
> whatever user code is receiving the edges better darn well do something
> about them, because it's only going to get them once.

Oh, I agree. I'm not saying that my approach magically fixes bugs in user
space ;)

> > The BSD kevent paper goes on about "level and edge triggered" and it
> > becomes a big thing for them, and they selected level-triggered events as
> > if it made any difference. And sure - it _does_ make a difference, but the
> > only difference is in how hard it is to implement, and level-triggered is
> > a noticeably harder.
>
> I don't see why edge triggered is that much harder. All it adds is
^^^^ level
> a layer which receives the edges and moves fds back and forth between
> a 'ready' list and a 'not ready' list. Easy as pie.

Not true.

For example, if you're truly level-triggered, and you have a socket that
gets data, the event move onto the event queue. So far so fine: both edge
and level agree about this one.

The point they disagree is when the event gets removed from the event
queue. For edge triggered, this one is trivial: when a get_events() thing
happens and moves it into user land. This is basically a one-liner, and it
is local to get_events() and needs absolutely no help from anybody else.
So obviously event removal is _very_ simple for edge-triggered events -
the INTACK basically removes the event (and also re-arms the trigger
logic: which is different from most interrupt controllers, so the analogy
falls down here).

For level, the thing is not as easy at ALL. Suddenly removal becomes a big
issue, and needs help from the actual driver. You can do it two ways:
calling down to the driver when you remove (to see if the event should be
dismissed or not once it has been read) or have the driver pro-actively
remove the event whenever a read() happens (or whatever that undoes the
event).

Both are actually fairly hard. Much harder than they sound. For different
reasons.

- the callback approach at get_events() time sounds trivial, but actually
   has two problems: cache footprint for "get_events()" grows a _lot_
   (because the events are likely to be spread out a lot if there are a
   lot of them pending, so you don't get a nice tight inner loop at all),
   and you get "double events" - by the time the event first happens, it
   will still be active, so we cannot actually remove it at that time
   (there is still data to be read - and the event doesn't go away until
   we read it) so we'll get the event _again_, and on the next
   get_events() it will notice that the event was bogus, and remove it
   (and we can optimize it away from reporting it to user land at that
   point, so only the kernel needs to look at it twice and do two
   callbacks)

- the "proactively remove events when the thing that triggerred them goes
   away" approach means that each anti-event (like a read that empties the
   buffer) needs to undo it's events, but it also needs to be careful that
   it doesn't undo combined events, and it needs to be very careful about
   races (new packet coming in), so the proactive remove actually ends up
   being less than trivial - and in a performance-critical section.

Now, compare that to a one-liner that just does a "list_del(&event->list)"
as it copies over the event to user mode. Woudln't you say that the
edge-triggered version is simpler?

> > The reason "edge-triggered" (ie only when an event changes) is preferable
> > is that it's MUCH simpler, and means that the "get_events()" system call
> > doesn't need to actually understand what the events mean at all.
>
> Not much understanding is required on the part of the edge-to-level filter.

Implement it, and see. I bet you'll find that it gets really interesting
when you have concurrent accesses to the same file descriptor etc.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

Next message: Martin Dalecki: "Re: PATCH: killing read_ahead[]"
Previous message: Guest section DW: "Re: IDE-Floppy and devfs"
In reply to: Dan Kegel: "Re: Linux's implementation of poll() not scalable?"
Next in thread: Dan Kegel: "Re: Linux's implementation of poll() not scalable?"
Reply: Dan Kegel: "Re: Linux's implementation of poll() not scalable?"
Reply: Edgar Toernig: "Re: Linux's implementation of poll() not scalable?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:14 EST