Re: [PATCH] workqueue: Don't spin forever inworker_maybe_bind_and_lock

From: Tejun Heo
Date: Wed May 04 2011 - 04:26:44 EST


On Wed, May 04, 2011 at 11:47:49AM +1000, Paul Mackerras wrote:
> On a 48-thread POWER7 box, I often see the system hang when offlining
> processors. What happens is that we get a rescuer thread trying to
> move to some processor at the same time that a cpu offline operation
> is happening for that processor, and we end up with one cpu spinning in
> worker_maybe_bind_and_lock() and all of the rest of the online cpus
> spinning inside the stop_machine code. The rescuer thread is
> continually calling set_cpus_allowed_ptr() which is continually
> failing because the cpu it is trying to move to is no longer in the
> cpu_active_mask. The result is a deadlock.
>
> This fixes worker_maybe_bind_and_lock so that it stops trying to move
> to a cpu if that cpu is no longer in the cpu_active_mask, and instead
> returns to its caller. With this I no longer see the deadlocks when
> offlining cpus.
>
> Signed-off-by: Paul Mackerras <paulus@xxxxxxxxx>

Hmm.. fix for the problem has already been merged into mainline and
scheduled for -stable. Can you please verify the following fixes the
problem?

Thank you.