Re: Threads stuck in zap_pid_ns_processes()

From: Vovo Yang
Date: Fri May 12 2017 - 05:30:51 EST


On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>
>> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>>> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>>>
>>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>>> >> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > the test program attached below almost always results in one of the child
>>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>>> >> > a single thread left in the pid namespace (the one that is stuck).
>>> >> > Traceback from /proc/<pid>/stack is
>>> >> >
>>> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>>> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>>> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>>> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>>> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>>> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>>> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>>> >> >
>>> >> > After 120 seconds, I get the "hung task" message.
>>> >> >
>>> >> > Example from v4.11:
>>> >> >
>>> >> > ...
>>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>>> >> > [ 3263.379561] Not tainted 4.11.0+ #1
>>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> >> > [ 3263.379577] clone D 0 27910 27909 0x00000000
>>> >> > [ 3263.379587] Call Trace:
>>> >> > [ 3263.379608] __schedule+0x677/0xda0
>>> >> > [ 3263.379621] ? pci_mmcfg_check_reserved+0xc0/0xc0
>>> >> > [ 3263.379634] ? task_stopped_code+0x70/0x70
>>> >> > [ 3263.379643] schedule+0x4d/0xd0
>>> >> > [ 3263.379653] zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [ 3263.379659] ? copy_pid_ns+0x4d0/0x4d0
>>> >> > [ 3263.379670] do_exit+0x10d4/0x1330
>>> >> > ...
>>> >> >
>>> >> > The problem is seen in all kernels up to v4.11.
>>> >> >
>>> >> > Any idea what might be going on and how to fix the problem ?
>>> >>
>>> >> Let me see. Reading the code it looks like we have three tasks
>>> >> let's call them main, child1, and child2.
>>> >>
>>> >> child1 and child2 are started using CLONE_THREAD and are
>>> >> thus clones of one another.
>>> >>
>>> >> child2 exits first but is ptraced by main so is not reaped.
>>> >> Further child2 calls do_group_exit forcing child1 to
>>> >> exit making for fun races.
>>> >>
>>> >> A ptread_exit() or syscall(SYS_exit, 0); would skip
>>> >> the group exit and make the window larger.
>>> >>
>>> >> child1 exits next and calls zap_pid_ns_processes and is
>>> >> waiting for child2 to be reaped by main.
>>> >>
>>> >> main is just sitting around doing nothing for 3600 seconds
>>> >> not reaping anyone.
>>> >>
>>> >> I would expect that when main exits everything would be cleaned up
>>> >> and the only real issue is that we have a hung task warning.
>>> >>
>>> >> Does everything cleanup when main exits?
>>> >>
>>> >
>>> > As an add-on to my previous mail: I added a function to count
>>> > the number of threads in the pid namespace, using next_pidmap().
>>> > Even though nr_hashed == 2, only the hanging thread is still
>>> > present.
>>>
>>> For your testcase? I suspect you copied the code from
>>> zap_pid_ns_processes and skipped pid 1. It is going to be pid 1 that is
>>> calling zap_pid_ns_processes.
>>>
>>
>> Almost. Something along the line of
>>
>> count = 0;
>> nr = next_pidmap(pid_ns, 0);
>> while (nr > 0) {
>> count++;
>> nr = next_pidmap(pid_ns, nr);
>> }
>>
>> only I also call sched_show_task() for each thread, and the only
>> one printed is the one that hangs in zap_pid_ns_processes().
>>
>>> > Is there maybe a better way to terminate the wait loop than
>>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>>>
>>> Not right now. nr_hashed counts the number of "struct pid"s that are
>>> in attached to processes and in the hash table for looking up tasks.
>>>
>>> Waiting until that drops to just the last task in the pid namespace
>>> really does do a good job of counting what we are looking for. It
>>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>>> nr_hashed will be 2. For various reasons.
>>>
>>> That said it isn't as obvious as it should be and I would like
>>> to improve that.
>>>
>>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>>> instead of TASK_UNITERRUPTIBLE. This really is not a short term
>>> disk sleep nor anything guaranteed to complete in any sort of
>>> finite time. Plus the hung task timeout warning only triggers
>>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>>> the warning and keep a long delay in zap_pid_ns_processes from
>>> bumping up your load average.
>>>
>>> That still leaves the question of what state strange state
>>> you are getting into but this should at least prevent the unexepcted
>>> reboots.
>>>
>>
>> What I know so far is
>> - We see this condition on a regular basis in the field. Regular is
>> relative, of course - let's say maybe 1 in a Milion Chromebooks
>> per day reports a crash because of it. That is not that many,
>> but it adds up.
>> - We are able to reproduce the problem with a performance benchmark
>> which opens 100 chrome tabs. While that is a lot, it should not
>> result in a kernel hang/crash.
>> - Vovo proviced the test code last night. I don't know if this is
>> exactly what is observed in the benchmark, or how it relates to the
>> benchmark in the first place, but it is the first time we are actually
>> able to reliably create a condition where the problem is seen.
>
> Thank you. I will be interesting to hear what is happening in the
> chrome perfomance benchmark that triggers this.
>

What's happening in the benchmark:
1. A chrome renderer process was created with CLONE_NEWPID
2. The process crashed
3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
threads of the crashed process to dump info
4. When breakpad detach the crashed process, the crashed process stuck in
zap_pid_ns_processes()

Thanks,
Vovo

> There may be some other steps that are worth taking.
>
>>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>>> index 77403c157157..971e7bc6939b 100644
>>> --- a/kernel/pid_namespace.c
>>> +++ b/kernel/pid_namespace.c
>>> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>> * if reparented.
>>> */
>>> for (;;) {
>>> - set_current_state(TASK_UNINTERRUPTIBLE);
>>> + set_current_state(TASK_INTERRUPTIBLE);
>>> if (pid_ns->nr_hashed == init_pids)
>>> break;
>>> schedule();
>>>
>>
>> I'll give it a try. Are there any potential unintended side effects ?
>
> No. The only reason I didn't implement this way in the first place is
> that I have cognitive disssonance between the name of that state and
> the fact the code isn't interruptible.
>
> In the long term it is probably better to refactor that code path so it
> doesn't exist in the first place, and use more common kernel idioms.
> But for right now it should not be a problem.
>
>
> Eric
>