Re: [PATCH 0/2] Add epoll round robin wakeup mode

From: Michael Kerrisk
Date: Mon Feb 09 2015 - 15:25:50 EST


[CC += linux-api@xxxxxxxxxxxxxxx]

Jason,

Since this is a kernel-user-space API change, please CC linux-api@.
The kernel source file Documentation/SubmitChecklist notes that all
Linux kernel patches that change userspace interfaces should be CCed
to linux-api@xxxxxxxxxxxxxxx, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html


Thanks,

Michael


On Mon, Feb 9, 2015 at 9:05 PM, Jason Baron <jbaron@xxxxxxxxxx> wrote:
> Hi,
>
> When we are sharing a wakeup source among multiple epoll fds, we end up with
> thundering herd wakeups, since there is currently no way to add to the
> wakeup source exclusively. This series introduces 2 new epoll flags,
> EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN
> which is to be used in conjunction to EPOLLEXCLUSIVE to evenly
> distribute the wakeups. I'm showing perf results from the simple pipe() usecase
> below. But this patch was originally motivated by a desire to improve
> wakeup balance and cpu usage for a shared listen socket().
>
> Perf stat, 3.19.0-rc7+, 4 core, Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz:
>
> pipe test wake all:
>
> Performance counter stats for './wake':
>
> 10837.480396 task-clock (msec) # 1.879 CPUs utilized
> 2047108 context-switches # 0.189 M/sec
> 214491 cpu-migrations # 0.020 M/sec
> 247 page-faults # 0.023 K/sec
> 23655687888 cycles # 2.183 GHz
> <not supported> stalled-cycles-frontend
> <not supported> stalled-cycles-backend
> 11242141621 instructions # 0.48 insns per cycle
> 2313479486 branches # 213.470 M/sec
> 13679036 branch-misses # 0.59% of all branches
>
> 5.768295821 seconds time elapsed
>
> pipe test wake balanced:
>
> Performance counter stats for './wake -o':
>
> 291.250312 task-clock (msec) # 0.094 CPUs utilized
> 40308 context-switches # 0.138 M/sec
> 1448 cpu-migrations # 0.005 M/sec
> 248 page-faults # 0.852 K/sec
> 646407197 cycles # 2.219 GHz
> <not supported> stalled-cycles-frontend
> <not supported> stalled-cycles-backend
> 364256883 instructions # 0.56 insns per cycle
> 65775397 branches # 225.838 M/sec
> 535637 branch-misses # 0.81% of all branches
>
> 3.086694452 seconds time elapsed
>
> Rough epoll manpage text:
>
> EPOLLEXCLUSIVE
> Provides exclusive wakeups when attaching multiple epoll fds to a
> shared wakeup source. Must be specified on an EPOLL_CTL_ADD operation.
>
> EPOLLROUNDROBIN
> Provides balancing for exclusive wakeups when attaching multiple epoll
> fds to a shared wakeup soruce. Must be specificed with EPOLLEXCLUSIVE
> during an EPOLL_CTL_ADD operation.
>
>
> Thanks,
>
> -Jason
>
> #include <unistd.h>
> #include <sys/epoll.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <pthread.h>
>
> #define NUM_THREADS 100
> #define NUM_EVENTS 20000
> #define EPOLLEXCLUSIVE (1 << 28)
> #define EPOLLBALANCED (1 << 27)
>
> int optimize, exclusive;
> int p[2];
> pthread_t threads[NUM_THREADS];
> int event_count[NUM_THREADS];
>
> struct epoll_event evt = {
> .events = EPOLLIN
> };
>
> void die(const char *msg) {
> perror(msg);
> exit(-1);
> }
>
> void *run_func(void *ptr)
> {
> int i = 0;
> int j = 0;
> int ret;
> int epfd;
> char buf[4];
> int id = *(int *)ptr;
> int *contents;
>
> if ((epfd = epoll_create(1)) < 0)
> die("create");
>
> if (optimize)
> evt.events |= ((EPOLLBALANCED | EPOLLEXCLUSIVE));
> else if (exclusive)
> evt.events |= EPOLLEXCLUSIVE;
> ret = epoll_ctl(epfd, EPOLL_CTL_ADD, p[0], &evt);
> if (ret)
> perror("epoll_ctl add error!\n");
>
> while (1) {
> ret = epoll_wait(epfd, &evt, 10000, -1);
> ret = read(p[0], buf, sizeof(int));
> if (ret == 4)
> event_count[id]++;
> }
> }
>
> int main(int argc, char *argv[])
> {
> int ret, i, j;
> int id[NUM_THREADS];
> int total = 0;
> int nohit = 0;
> int extra_wakeups = 0;
>
> if (argc == 2) {
> if (strcmp(argv[1], "-o") == 0)
> optimize = 1;
> if (strcmp(argv[1], "-e") == 0)
> exclusive = 1;
> }
>
> if (pipe(p) < 0)
> die("pipe");
>
> for (i = 0; i < NUM_THREADS; i++) {
> id[i] = i;
> pthread_create(&threads[i], NULL, run_func, &id[i]);
> }
>
> for (j = 0; j < NUM_EVENTS; j++) {
> write(p[1], p, sizeof(int));
> usleep(100);
> }
>
> for (i = 0; i < NUM_THREADS; i++) {
> pthread_cancel(threads[i]);
> printf("joined: %d\n", i);
> printf("event count: %d\n", event_count[i]);
> total += event_count[i];
> if (!event_count[i])
> nohit++;
> }
>
> printf("total events is: %d\n", total);
> printf("nohit is: %d\n", nohit);
> }
>
>
> Jason Baron (2):
> sched/wait: add round robin wakeup mode
> epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
>
> fs/eventpoll.c | 25 ++++++++++++++++++++-----
> include/linux/wait.h | 11 +++++++++++
> include/uapi/linux/eventpoll.h | 6 ++++++
> kernel/sched/wait.c | 5 ++++-
> 4 files changed, 41 insertions(+), 6 deletions(-)
>
> --
> 1.8.2.rc2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/