Re: Strange issues with epoll since 5.0

From: Davidlohr Bueso
Date: Wed Apr 24 2019 - 17:34:25 EST


On Wed, 24 Apr 2019, Eric Wong wrote:

Omar Kilani <omar.kilani@xxxxxxxxx> wrote:
Hi there,

I???m still trying to piece together a reproducible test that triggers
this, but I wanted to post in case someone goes ???hmmm... change X
might have done this???.

Maybe Davidlohr knows, since he's responsible for most of the
epoll changes in 5.0.

Not really, I have not been made aware of any issues until now.


Basically, something???s broken (or at least, has changed enough to
cause problems in user space) in epoll since 5.0. It???s still broken in
5.1-rc5.

It doesn???t happen 100% of the time. It???s sort of hard to pin down but
I???ve observed the following:

* nginx not accepting connections under load
* A java app which uses netty / NIO having strange writability
semantics on channels, which confuses netty / java enough to not
properly flush written data on the socket.

I went and tested these Linux kernels:

4.20.17
4.19.32
4.14.111

And the issue(s) do not show up there.

I???m still actively chasing this up, and will report back ??? I haven???t
touched kernel code in 15 years so I???m a little rusty. :)

A bisection and/or workload that triggers the issue would be great to
see what's going on.

Thanks,
Davidlohr