Re: [BUG] hotplug cpus on ia64

From: Cliff Wickman
Date: Tue Jun 03 2008 - 18:19:50 EST



On Fri, May 30, 2008 at 03:36:54PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-05-29 at 11:32 -0500, Cliff Wickman wrote:
> > >> I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1)
> > >> and get a very predictable hotplug cpu problem.
> > >> billberry1:/tmp/cpw # ./dis
> > >> disabled cpu 17
> > >> enabled cpu 17
> > >> billberry1:/tmp/cpw # ./dis
> > >> disabled cpu 17
> > >> enabled cpu 17
> > >> billberry1:/tmp/cpw # ./dis
> > >>
> > >> The script that disables the cpu always hangs (unkillable)
> > >> on the 3rd attempt.
> >
> > > And a bit further:
> > > The kstopmachine thread always sits on the run queue (real time) for about
> > > 30 minutes before running.
> >
> > And a bit further:
> >
> > The kstopmachine thread is queued as real-time on the downed cpu:
> > >> rq -f 17
> > CPU# runq address size Lock current task time name
> > ==========================================================================
> > 17 0xe000046003059540 3 U 0xe0000360f06f8000 0 swapper
> > Total of 3 queued:
> > 3 real time tasks: px *(rt_rq *)0xe000046003059608
> > exclusive queue:
> > slot 0
> > 0xe0000760f4628000 0 migration/17
> > 0xe0000760f4708000 0 kstopmachine
> > 0xe0000760f6678000 0 watchdog/17
> >
> > I put in counters and see that schedule() is never again entered by cpu 17
> > after it is downed the 3rd time.
> > (it is entered after being up'd the first two times)
> >
> > The kstopmachine thread is bound to cpu 17 by __stop_machine_run()'s call
> > to kthread_bind().
> >
> > A cpu does not schedule after being downed, of course. But it does again
> > after being up'd.
> > Why would the second up be different? Following it, if the cpu is
> > downed it never schedules again.
> >
> > If I always bind kstopmachine to cpu 0 the problem disappears.
>
> does:
>
> echo -1 > /proc/sys/kernel/sched_rt_runtime_us
>
> fix the problem?

Yes! It does.

Dimitri Sivanich has run into what looks like a similar problem.
Hope the above workaround is a good clue to its solution.

--
Cliff Wickman
Silicon Graphics, Inc.
cpw@xxxxxxx
(651) 683-3824
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/