Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes aredealocked when cpu is set to offline

From: Yi Yang
Date: Tue Mar 04 2008 - 04:43:16 EST


> This is the hung_task_timeout message after a couple of cpu-offlines.
>
> This is on 2.6.25-rc3.
>
> INFO: task bash:4467 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> f3701dd0 00000046 f796aac0 f796aac0 f796abf8 cc434b80 00000000 f41ee940
> 0180b046 0000026e 00000016 00000000 00000008 f796b080 f796aac0 00000002
> 7fffffff 7fffffff f3701e1c f3701df8 c04e033a f3701e1c f3701dec c0139dec
> Call Trace:
> [<c04e033a>] schedule_timeout+0x16/0x8b
> [<c0139dec>] ? trace_hardirqs_on+0xe9/0x111
> [<c04e01c9>] wait_for_common+0xcf/0x12e
> [<c011a3f0>] ? default_wake_function+0x0/0xd
> [<c04e02aa>] wait_for_completion+0x12/0x14
> [<c012ccbb>] flush_cpu_workqueue+0x50/0x66
> [<c012cd28>] ? wq_barrier_func+0x0/0xd
> [<c012cd14>] cleanup_workqueue_thread+0x43/0x57
> [<c04c6f87>] workqueue_cpu_callback+0x8e/0xbd
> [<c04e3975>] notifier_call_chain+0x2b/0x4a
> [<c0132e9d>] __raw_notifier_call_chain+0xe/0x10
> [<c0132eab>] raw_notifier_call_chain+0xc/0xe
> [<c013e054>] _cpu_down+0x150/0x1ec
> [<c013e133>] cpu_down+0x23/0x30
> [<c02e3897>] store_online+0x27/0x5a
> [<c02e3870>] ? store_online+0x0/0x5a
> [<c02e09d8>] sysdev_store+0x20/0x25
> [<c0196d2d>] sysfs_write_file+0xad/0xdf
> [<c0196c80>] ? sysfs_write_file+0x0/0xdf
> [<c0163da9>] vfs_write+0x8c/0x108
> [<c0164333>] sys_write+0x3b/0x60
> [<c01049da>] sysenter_past_esp+0x5f/0xa5
> =======================
> 3 locks held by bash/4467:
> #0: (&buffer->mutex){--..}, at: [<c0196ca5>] sysfs_write_file+0x25/0xdf
> #1: (cpu_add_remove_lock){--..}, at: [<c013e10e>] cpu_maps_update_begin+0xf/0x11
> #2: (cpu_hotplug_lock){----}, at: [<c013df5b>] _cpu_down+0x57/0x1ec
>
> So it's not just a not reaping of watchdog thread issue.
>
> I doubt it's due to some locking dependency since we have lockdep checks
> in the workqueue code before we flush the cpu_workqueue.
You may "echo 1 > /proc/sys/kernel/sysrq" and "echo t
> /proc/sysrq-trigger", then check dmesg info, you can get
[watchdog/#]'s call stack which could give out where it is currently.

On my machine, that indicated [watchdog/1] is calling
sched_setscheduler. I doubt it is being killed before it is started and
woken up, this may result in some synchronization issues.

>
> --
> Thanks and Regards
> gautham

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/