Re: epoll and multiple processes - eliminate unneeded process wake-ups

From: Eric Wong
Date: Mon Aug 03 2015 - 19:56:47 EST


Madars Vitolins <m@xxxxxxxxxxx> wrote:
> Hi Folks,
>
> I am developing kind of open systems application, which uses
> multiple processes/executables where each of them monitors some set
> of resources (in this case POSIX Queues) via epoll interface. For
> example when 10 processes on same queue are in state of epoll_wait()
> and one message arrives, all 10 processes gets woken up and all of
> them tries to read the message from Q. One succeeds, the others gets
> EAGAIN error. The problem is with those others, which generates
> extra context switches - useless CPU usage. With more processes
> inefficiency gets higher.
>
> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
> multi-threaded application and not for multi-process application.

Correct. Most FDs are not shared across processes.

> Ideal mechanism for this would be:
> 1. If multiple epoll sets in kernel matches same event and one or
> more processes are in state of epoll_wait() - then send event only
> to one waiter.
> 2. If none of processes are in wait state, then send the event to
> all epoll sets (as it is currently). Then the first free process
> will grab the event.

Jason Baron was working on this (search LKML archives for
EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)

However, I was unconvinced about modifying epoll.

Perhaps I may be more easily convinced about your mqueue case than his
case for listen sockets, though[*]

Typical applications have few (probably only one) listen sockets or
POSIX mqueues; so I would rather use dedicated threads to issue
blocking syscalls (accept4 or mq_timedreceive).

Making blocking syscalls allows exclusive wakeups to avoid thundering
herds.

> How do you think, would it be real to implement this? How about
> concurrency?
> Can you please give me some hints from which points in code to start
> to implement these changes?

For now, I suggest dedicating a thread in each process to do
mq_timedreceive/mq_receive, assuming you only have a small amount
of queues in your system.


[*] mq_timedreceive may copy a largish buffer which benefits from
staying on the same CPU as much as possible.
Contrary, accept4 only creates a client socket. With a C10K+
socket server (e.g. http/memcached/DB), a typical new client
socket spends a fair amount of time idle. Thus I don't believe
memory locality inside the kernel is much concern when there's
thousands of accepted client sockets.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/