Re: Glibc recvmsg from kernel netlink socket hangs forever

From: Guenter Roeck
Date: Fri Sep 25 2015 - 00:37:07 EST


On Thu, Sep 24, 2015 at 05:29:34PM -0700, Steven Schlansker wrote:
> Hello linux-kernel,
>
> I write to you on behalf of many developers at my company, who
> are having trouble with their applications endlessly locking up
> inside of libc code, with no hope of recovery.
>
> Currently it affects our Mono and Node processes mostly, and the
> symptoms are the same: user code invokes getaddrinfo, and libc
> attempts to determine whether ipv4 or ipv6 is appropriate, by using
> the RTM_GETADDR netlink message. The write into the netlink socket
> succeeds, and it immediately reads back the results ... and waits
> forever. The read never returns. The stack looks like this:
>
> #0 0x00007fd7d8d214ad in recvmsg () at ../sysdeps/unix/syscall-template.S:81
> #1 0x00007fd7d8d3e44d in make_request (fd=fd@entry=13, pid=1) at ../sysdeps/unix/sysv/linux/check_pf.c:177
> #2 0x00007fd7d8d3e9a4 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7fd7d37fdd00, seen_ipv6=seen_ipv6@entry=0x7fd7d37fdd10,
> in6ai=in6ai@entry=0x7fd7d37fdd40, in6ailen=in6ailen@entry=0x7fd7d37fdd50) at ../sysdeps/unix/sysv/linux/check_pf.c:341
> #3 0x00007fd7d8cf64e1 in __GI_getaddrinfo (name=0x31216e0 "mesos-slave4-prod-uswest2.otsql.opentable.com", service=0x0,
> hints=0x31216b0, pai=0x31f09e8) at ../sysdeps/posix/getaddrinfo.c:2355
> #4 0x0000000000e101c8 in uv__getaddrinfo_work (w=0x31f09a0) at ../deps/uv/src/unix/getaddrinfo.c:102
> #5 0x0000000000e09179 in worker (arg=<optimized out>) at ../deps/uv/src/threadpool.c:91
> #6 0x0000000000e16eb1 in uv__thread_start (arg=<optimized out>) at ../deps/uv/src/unix/thread.c:49
> #7 0x00007fd7d8ff3182 in start_thread (arg=0x7fd7d37fe700) at pthread_create.c:312
> #8 0x00007fd7d8d2047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> (libuv is part of Node and makes DNS lookups "asynchronous" by having
> a thread pool in the background working)
>
> The applications will run for hours or days successfully, until eventually hanging with
> no apparent pattern or cause. And once this hang happens it hangs badly, because
> check_pf is holding a lock during the problematic recvmsg call.
>
> I raised this issue on the libc-help mailing list, but I'm hoping that lkml will
> have a higher number of people familiar with netlink that may better offer advice.
> The original thread is here:
> https://sourceware.org/ml/libc-help/2015-09/msg00014.html
>
> Looking at the getaddrinfo / check_pf source code:
> https://fossies.org/dox/glibc-2.22/sysdeps_2unix_2sysv_2linux_2check__pf_8c_source.html
>
> 146 if (TEMP_FAILURE_RETRY (__sendto (fd, (void *) &req, sizeof (req), 0,
> 147 (struct sockaddr *) &nladdr,
> 148 sizeof (nladdr))) < 0)
> 149 goto out_fail;
> 150
> 151 bool done = false;
> 152
> 153 bool seen_ipv4 = false;
> 154 bool seen_ipv6 = false;
> 155
> 156 do
> 157 {
> 158 struct msghdr msg =
> 159 {
> 160 (void *) &nladdr, sizeof (nladdr),
> 161 &iov, 1,
> 162 NULL, 0,
> 163 0
> 164 };
> 165
> 166 ssize_t read_len = TEMP_FAILURE_RETRY (__recvmsg (fd, &msg, 0));
> 167 if (read_len <= 0)
> 168 goto out_fail;
> 169
> 170 if (msg.msg_flags & MSG_TRUNC)
> 171 goto out_fail;
> 172
>
> I notice that there is possibility that if messages are dropped either on send
> or receive side, maybe this code will hang forever? The netlink(7) man page makes
> me slightly worried:
>
> > Netlink is not a reliable protocol. It tries its best to deliver a message to its destination(s), but may drop messages when an out-of-memory condition or other error occurs.
> > However, reliable transmissions from kernel to user are impossible in any case. The kernel can't send a netlink message if the socket buffer is full: the message will be dropped and the kernel and the user-space process will no longer have the same view of kernel state. It is up to the application to detect when this happens (via the ENOBUFS error returned by recvmsg(2)) and resynchronize.
>
>
> I have taken the glibc code and created a simple(r) program to attempt to reproduce this issue.
> I inserted some simple polling between the sendto and recvmsg calls to make the failure case more evident:
>
> struct pollfd pfd;
> pfd.fd = fd;
> pfd.events = POLLIN;
> pfd.revents = 0;
>
> int pollresult = poll(&pfd, 1, 1000);
> if (pollresult < 0) {
> perror("glibc: check_pf: poll");
> abort();
> } else if (pollresult == 0 || pfd.revents & POLLIN == 0) {
> fprintf(stderr, "[%ld] glibc: check_pf: netlink socket read timeout\n", gettid());
> abort();
> }
>
> I have placed the full source code and strace output here:
> https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473
>
> The process quickly sends off hundreds of threads which sit in a
> loop attempting this RTM_GETADDR message exchange.
>
> The code may be compiled as "gcc -o pf_dump -pthread pf_dump.c"
>
> An example invocation that quickly fails:
>
> root@24bf2e440b5e:/# strace -ff -o pfd ./pf_dump
> [3700] exit success
> glibc: check_pf: netlink socket read timeout
> Aborted (core dumped)
>
> Interestingly, this seems to be very easy to reproduce using pthreads, but much less
> common with fork() or clone()d threads. I'm not sure if this is just an artifact
> of how I am testing or an actual clue, but I figured I'd mention it.
>
> I have tested this program on vanilla kernels 4.0.4 and 4.2.1 -- the 4.0.4 version
> reliably crashes, but I am having trouble reproducing on 4.2.1
>
> So usually I would upgrade to 4.2.1 and be happy, except we ran into serious problems
> with 4.1.2 and are now a little shy about upgrading:
>
> https://bugzilla.xamarin.com/show_bug.cgi?id=29212
>
> So my questions from here are:
>
> * Is this glibc code correct?
> * What are the situations where a recvmsg from a netlink socket can hang as it does here?
> * Is the potential "fix" in 4.2.1 due to any particular commit? I checked the changelogs and nothing caught my eye.
>
> We'll be testing out 4.2.1 more thoroughly over the coming days but I am hoping someone here
> can shed some light on our problem.
>
http://comments.gmane.org/gmane.linux.network/363085

might explain your problem.

I thought this was resolved in 4.1, but it looks like the problem still persists
there. At least I have reports from my workplace that 4.1.6 and 4.1.7 are still
affected. I don't know if there have been any relevant changes in 4.2.

Copying Herbert and Eric for additional input.

Guenter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/