Re: Threads stuck in zap_pid_ns_processes()

From: Eric W. Biederman
Date: Thu May 11 2017 - 16:29:52 EST


Guenter Roeck <linux@xxxxxxxxxxxx> writes:

> On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>>
>> > Hi all,
>> >
>> > the test program attached below almost always results in one of the child
>> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> > a single thread left in the pid namespace (the one that is stuck).
>> > Traceback from /proc/<pid>/stack is
>> >
>> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> > After 120 seconds, I get the "hung task" message.
>> >
>> > Example from v4.11:
>> >
>> > ...
>> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> > [ 3263.379561] Not tainted 4.11.0+ #1
>> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [ 3263.379577] clone D 0 27910 27909 0x00000000
>> > [ 3263.379587] Call Trace:
>> > [ 3263.379608] __schedule+0x677/0xda0
>> > [ 3263.379621] ? pci_mmcfg_check_reserved+0xc0/0xc0
>> > [ 3263.379634] ? task_stopped_code+0x70/0x70
>> > [ 3263.379643] schedule+0x4d/0xd0
>> > [ 3263.379653] zap_pid_ns_processes+0x1ee/0x2a0
>> > [ 3263.379659] ? copy_pid_ns+0x4d0/0x4d0
>> > [ 3263.379670] do_exit+0x10d4/0x1330
>> > ...
>> >
>> > The problem is seen in all kernels up to v4.11.
>> >
>> > Any idea what might be going on and how to fix the problem ?
>>
>> Let me see. Reading the code it looks like we have three tasks
>> let's call them main, child1, and child2.
>>
>> child1 and child2 are started using CLONE_THREAD and are
>> thus clones of one another.
>>
>> child2 exits first but is ptraced by main so is not reaped.
>> Further child2 calls do_group_exit forcing child1 to
>> exit making for fun races.
>>
>> A ptread_exit() or syscall(SYS_exit, 0); would skip
>> the group exit and make the window larger.
>>
>> child1 exits next and calls zap_pid_ns_processes and is
>> waiting for child2 to be reaped by main.
>>
>> main is just sitting around doing nothing for 3600 seconds
>> not reaping anyone.
>>
>> I would expect that when main exits everything would be cleaned up
>> and the only real issue is that we have a hung task warning.
>>
>> Does everything cleanup when main exits?
>>
> Yes, it does.
>
> Problem is that if the main task doesn't exit, it hangs forever.
> Chrome OS (where we see the problem in the field, and the application
> is chrome) is configured to reboot on hung tasks - if a task is hung
> for 120 seconds on those systems, it tends to be in a bad shape.
> This makes it a quite severe problem for us.

We definitely need to change the code in such a way as to not
trigger the hung task warning.

> I browsed through a number of stack traces - the ones I looked at
> all have the same traceback from do_group_exit(), though the
> caller of do_group_exit() is not always the same (that may depend
> on the kernel version, though). Sometimes it is __wake_up_parent(),
> sometimes it is get_signal_to_deliver(), sometimes it is
> get_signal().
>
> Are there other conditions besides ptrace where a task isn't reaped ?

The problem here is that the parent was outside of the pid namespace
and was choosing not to reap the child. The main task could easily
have reaped the child when it received SIGCHLD.

Arranging for forked children to run in a pid namespace with setns
can also create processes with parents outside the pid namespace,
that may or may not be reaped.

I suspect it is a bug to allow PTRACE_TRACEME if either the parent
or the child has called exec. That is not enough to block your test
case but I think it would block real world problems.

For getting to the root cause of this I would see if I could get
a full process list. It would be interesting to know what
the unreaped zombie is.

In the meantime I am going to see what it takes to change the code
so it doesn't trigger the hung task warning. I don't want to change
the actual observed behavior of the code but I can avoid triggering
a warning that does not apply.

Eric


>
> Thanks,
> Guenter
>
>> Eric
>>
>>
>> >
>> > Thanks,
>> > Guenter
>> >
>> > ---
>> > This test program was kindly provided by Vovo Yang <vovoy@xxxxxxxxxx>.
>> >
>> > Note that the ptrace() call in child1() is not necessary for the problem
>> > to be seen, though it seems to make it a bit more likely.
>>
>> That would appear to just slow things down a smidge. As there is
>> nothing substantial that happens ptrace wise except until after
>> zap_pid_ns_processes.
>>
>>
>> > ---
>> >
>> > #define _GNU_SOURCE
>> > #include <stdio.h>
>> > #include <stdlib.h>
>> > #include <unistd.h>
>> > #include <sys/ptrace.h>
>> > #include <errno.h>
>> > #include <string.h>
>> > #include <sched.h>
>> >
>> > #define STACK_SIZE 65536
>> >
>> > int child1(void* arg);
>> > int child2(void* arg);
>> >
>> > int main(int argc, char **argv)
>> > {
>> > int child_pid;
>> > char* child_stack = malloc(STACK_SIZE);
>> > char* stack_top = child_stack + STACK_SIZE;
>> > char command[256];
>> >
>> > child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL);
>> > if (child_pid == -1) {
>> > printf("parent: clone failed: %s\n", strerror(errno));
>> > return EXIT_FAILURE;
>> > }
>> > printf("parent: child1_pid: %d\n", child_pid);
>> >
>> > sleep(2);
>> > printf("child state, if it's D (disk sleep), the child process is hung\n");
>> > sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
>> > system(command);
>> > sleep(3600);
>> > return EXIT_SUCCESS;
>> > }
>> >
>> > int child1(void* arg)
>> > {
>> > int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
>> > char* child_stack = malloc(STACK_SIZE);
>> > char* stack_top = child_stack + STACK_SIZE;
>> > long ret;
>> >
>> > ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>> > if (ret == -1) {
>> > printf("child1: ptrace failed: %s\n", strerror(errno));
>> > return EXIT_FAILURE;
>> > }
>> >
>> > ret = clone(&child2, stack_top, flags, NULL);
>> > if (ret == -1) {
>> > printf("child1: clone failed: %s\n", strerror(errno));
>> > return EXIT_FAILURE;
>> > }
>> > printf("child1: child2 pid: %ld\n", ret);
>> >
>> > sleep(1);
>> > printf("child1: end\n");
>> > return EXIT_SUCCESS;
>> > }
>> >
>> > int child2(void* arg)
>> > {
>> > long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>> > if (ret == -1) {
>> > printf("child2: ptrace failed: %s\n", strerror(errno));
>> > return EXIT_FAILURE;
>> > }
>> >
>> > printf("child2: end\n");
>> > return EXIT_SUCCESS;
>> > }