Re[2]: epoll'ing tcp sockets for reading

From: Nikolai ZHUBR
Date: Sun Dec 20 2009 - 18:22:04 EST

Hello Willy,
Sunday, December 20, 2009, 7:14:22 PM, Willy Tarreau wrote:
>> > The same thing can approximately be "emulated" by requesting FIOREAD for
>> > all EPOLLIN-ready sockets just after epoll returns, before any other work.
>> > It just would look not very elegant IMHO.
>> No such a thing of "atomic matter", since by the time you read the event,
>> more data might have come. It's just flawed, you see that?
Well, a carefull application should choose to not read such newly appeared
data at this point yet, because this data actually belongs to the next
turn, see below. In other words, the read limit is known at the time of
epoll return and this value need not be changed till the next epoll,
no matter more data arrives meanwhile. (And that is why FIONREAD is not
perfectly good for that - it always reports all data at the moment)

> I think that what Nikolai meant was the ability to wake up as soon as
> there are *at least* XXX bytes ready. But while I can understand why
> it would in theory save some code, in practice he would still have to

Uhhh, no. What I want is to ensure that incoming blocks of network data
(possibly belonging to different connections) are pulled in and processed
by application approximately in the same order as they arrive from the
network. As long as no real queue exists for that, an application must
at least care to _limit_ the amount of data it reads from any socket per
one epoll call. (Otherwise, some very active connection with lots of
incoming data might cause other connections starve badly).
So, the application will need to find the value for the above limit.
Most reasonable value, imho, would be simply the amount of data that
actually arrived on this socket between two successive epoll calls (the
latest one and the previous one). My point was that it would be handy
if epoll offered some way to get this value automatically (filled in
epoll_event maybe?).
(Though, probably FIONREAD can do the job reasonably well in most cases)

Thank you!

Nikolai ZHUBR

> properly handle corner cases, which would defeat the original purpose
> of his modification :

> - if he waits for larger data than the socket buffer can handle, he
> will never wake up ;

> - if my memory serves me right, the copy_and_cksum() code only knows
> whether a segment is correct during its transfer to userland, which
> means that epoll() could very well wake up with XXX apparent bytes
> ready, but the read would fail before XXX due to an invalid checksum
> on an intermediate segment. So the code would still have to take
> care of that situation anyway.

> The last point implies the complete implementation of the code he wants
> to avoid anyway, and the first one implies it will be hard to know when
> this would work and when this would not. This means that while at first
> glance this behaviour could be useful, it would in practice be useless.

> Regards,
> Willy

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at