Re: [BUG] [ tip/sched/core ] System unresponsive after booting

From: Daniel Lezcano
Date: Thu Jan 16 2014 - 08:49:02 EST


On 01/15/2014 01:04 PM, Peter Zijlstra wrote:
On Wed, Jan 15, 2014 at 09:27:34AM +0100, Daniel Lezcano wrote:

Hi all,

I use the tip/sched/core branch.

After git pulling yesterday, my host is unresponsive after booting the OS.

* It boots normally
* It sends info to the console
* The graphics does not work
* The terminals show the prompt, I can enter the username but after
pressing enter, it does not give the password prompt
* sysrq works more or less, I can't get the process stack but it receives
the command

It is like no new process can be created.

I have a dual Xeon processor E5325 (2 x 4 cores).

After git bisecting, the following patch seems to introduce the bug.

commit d50dde5a10f305253cbc3855307f608f8a3c5f73

OK, so my headless WSM-EP boots just fine. Obviously it cannot confirm
if graphics works, but I can ssh in and work on it without bother.

I can even log in on the serial console without problems.

I tried both tip/master and tip/sched/core.

Would you happen to have a .config for me to try?

I was able to reduce the scope and reproduce the issue.

AFAICT, that happens with rsyslogd. When login in a tty, the login command sends a message through /dev/log. But rsyslogd is never woken up and blocked in poll_schedule_timeout. The login process is blocked in unix_wait_for_peer.

I can strace rsyslogd at startup. The two last sched_setscheduler calls fail.

> grep sched trace.out

3570 sched_getparam(3570, { 0 }) = 0
3570 sched_getscheduler(3570) = 0 (SCHED_OTHER)
3570 sched_get_priority_min(SCHED_OTHER) = 0
3570 sched_get_priority_max(SCHED_OTHER) = 0
3571 sched_get_priority_min(SCHED_OTHER) = 0
3571 sched_get_priority_max(SCHED_OTHER) = 0
3571 sched_get_priority_min(SCHED_OTHER) = 0
3571 sched_get_priority_max(SCHED_OTHER) = 0
3571 sched_setscheduler(3572, SCHED_OTHER, { 0 } <unfinished ...>
3571 <... sched_setscheduler resumed> ) = 0
3571 sched_get_priority_min(SCHED_OTHER <unfinished ...>
3571 <... sched_get_priority_min resumed> ) = 0
3571 sched_get_priority_max(SCHED_OTHER <unfinished ...>
3571 <... sched_get_priority_max resumed> ) = 0
3571 sched_setscheduler(3573, SCHED_OTHER, { 0 } <unfinished ...>
3571 <... sched_setscheduler resumed> ) = -1 EPERM (Operation not permitted)
3571 sched_get_priority_min(SCHED_OTHER <unfinished ...>
3571 <... sched_get_priority_min resumed> ) = 0
3571 sched_get_priority_max(SCHED_OTHER <unfinished ...>
3571 <... sched_get_priority_max resumed> ) = 0
3571 sched_setscheduler(3574, SCHED_OTHER, { 0 } <unfinished ...>
3571 <... sched_setscheduler resumed> ) = -1 EPERM (Operation not permitted)

The same strace but on a kernel which does not hang. The calls to sched_setscheduler do not fail.

3292 sched_getparam(3292, { 0 }) = 0
3292 sched_getscheduler(3292) = 0 (SCHED_OTHER)
3292 sched_get_priority_min(SCHED_OTHER) = 0
3292 sched_get_priority_max(SCHED_OTHER) = 0
3293 sched_get_priority_min(SCHED_OTHER) = 0
3293 sched_get_priority_max(SCHED_OTHER) = 0
3293 sched_get_priority_min(SCHED_OTHER) = 0
3293 sched_get_priority_max(SCHED_OTHER) = 0
3293 sched_setscheduler(3294, SCHED_OTHER, { 0 } <unfinished ...>
3293 <... sched_setscheduler resumed> ) = 0
3293 sched_get_priority_min(SCHED_OTHER <unfinished ...>
3293 <... sched_get_priority_min resumed> ) = 0
3293 sched_get_priority_max(SCHED_OTHER <unfinished ...>
3293 <... sched_get_priority_max resumed> ) = 0
3293 sched_setscheduler(3295, SCHED_OTHER, { 0 } <unfinished ...>
3293 <... sched_setscheduler resumed> ) = 0
3293 sched_get_priority_min(SCHED_OTHER <unfinished ...>
3293 <... sched_get_priority_min resumed> ) = 0
3293 sched_get_priority_max(SCHED_OTHER <unfinished ...>
3293 <... sched_get_priority_max resumed> ) = 0
3293 sched_setscheduler(3296, SCHED_OTHER, { 0 } <unfinished ...>
3293 <... sched_setscheduler resumed> ) = 0

The EPERM error comes from kernel/sched/core.c:3303

...
if (fair_policy(policy)) {
if (!can_nice(p, attr->sched_nice))
return -EPERM;
}
...


But I don't know why this is leading to block a process or making rsyslogd being not woken up by a packet coming in the af_unix socket.

I hope that helps

-- Daniel


--
<http://www.linaro.org/> Linaro.org â Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/