Re: [patch 1/2] epoll fix own poll()

From: Pavel Pisa
Date: Fri Jan 30 2009 - 05:17:21 EST


On Friday 30 January 2009 08:41:53 Davide Libenzi wrote:
> On Fri, 30 Jan 2009, Pavel Pisa wrote:
> > epoll_wait() unconditionally calls ep_poll(), the "rdllist" is not empty
> > (cheked by previous ep_eventpoll_poll() call), so ep_send_events()
> > has to be called. It moves all events from "rdllist" onto "txlist"
> > and reports events confirmed by individual poll() calls to user.
> > There could be non signaling events in the "rdlist" left there from
> > previous scan for level triggering events or in case, that condition
> > is reset by some other intervention before ep_eventpoll_poll().
> > All this is not problem yet, it would result in one abundant call to
> > epoll_wait() which returns 0. No problem. The EPOLLET events
> > are moved from "rdlist", others (level triggered events) signalling
> > are queued on "rdlist" again. If userspace does its duty, they would
> > be removed during next call. This all seems to suggests, that one
> > abundant epoll_wait() would get things back on the right path.
> > All critical processing (moving between lists) seems to be under
> > held of spinlock, so possibility for bugs is low. I have not yet analyzed
> > overflow list which complicates things.
> >
> > I am not clever enough, to see, where is the problem, that situation
> > is not stabilized. Can you teach me Davide, please?
>
> Sorry, I missed part of the email. Yes, if revents is zero in the
> send-lood, I don't see how files can be added back to the ready-list.
> Please, create a very small test program that exhibit such behaviour ...

Hello Davide,

I have seen the reason exactly when I have sent the e-mail
and switched computer off at the night.

The reason is simple and it is caused by debugging print in each
loop in my application and by way how poll concept is implemented
in the kernel.

There is continuation of my analyze. The event is correctly cleaned
in ep_send_events but there is the culprit in my code, I have not
count before

+ add FD #0 into epoll set to be monitored for read event
do {
select/poll/epoll_wait on epfd
/* above line returns imediately because epoll->poll() indicates ready event*/
if there is event on epfd {
epoll_wait ( epfd, timeout = 0) /* to find which event is ready */
/* but no ready event is reported and cycle starts again */
}
+ debugging code prints something to FD #0
} while(1)

The the concept how file_operations_xxx.poll(struct file *file, poll_table *wait)
functions works, does not allow to distinguish for which conditions
callers waits. So there is only one list for IN/OUT/HUP etc events.
Even if the device/file/socked has three wait queues (one for these
waiting for IN, other for OUT and third third for IN|OUT),
there is no way to check callers intention so IN|OUT has to be
used unconditionally. Back to my code, when debug write is finished,
wait queue callback for FD #0 is called and this inserts epitem
back on "rdlist" and loop can continue for ever.

So the fault in this case is distributed between
- my application (it should not perform any activity -
debugging output - if there is no real noncascade event reported
and no time expired or event event activated by other thread)
- poll concept with common wait queue which results
in abundant wakeups and recheck need
- epoll implementation, which is not able to hide abundant wakes
to not be seen from userspace

I have not fully analyzed updated code version, but I expect,
that it moves full "rdlist" checking into ep_eventpoll_poll().
This hides abundant wakes at price of checking groving
list of really signaled events for each call. It has advantage,
that it could remove entries which really stop to signal events
for other reasons, but rechecking again and again has performance
penalty.

So the ideal solution is to change whole kernel poll mechanism
to allow separate IN and OUT queues for all important poll() function
providers. The other solution is to provide mask of intentionally wait
for events as additional parameter in file_operations_xxx.poll
which would to use optionally three queue concept (IN, OUT, IN|OUT)
for high bandwidth devices and significantly reduce number of
unnecessary wakeups.

As for epoll, you still need overflow list there, so the checking overhead
could be minimized, if all newly reported events are not put onto
"rdlist" directly, but onto some "potentionally_ready_list".
The ep_eventpoll_poll() should only call file.poll() for these epitems.
If the poll() reports ready, they should be moved to the final "rdlist".
If the "rdlist" is not empty, the IN event should be reported back.

Even in the actual poll scenario, the epoll overhead can be decreased
and ill cases can be eliminated.
The ep_poll() should call this processing of "potentionally_ready_list"
before check for "rdlist" before wait decision. In the current scenario,
same busy loop can be created even without cascading if there
is invoked unconditionally some operation on some of epoll set
monitored events which causes wakeup call but is then masked
by epitem events mask.

I have feeling, that I have seen problem report about some application
and epoll, which could be caused by this scenario somewhere.
But I cannot find it again.

So thanks for support and reply,

Pavel




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/