Re: kernel BUG at kernel/sched_rt.c:322!

From: Paul E. McKenney
Date: Thu Oct 09 2008 - 22:13:32 EST


On Fri, Oct 10, 2008 at 09:54:11AM +0800, Zhang, Yanmin wrote:
>
> On Thu, 2008-10-09 at 05:31 -0700, Paul E. McKenney wrote:
> > On Thu, Oct 09, 2008 at 07:06:38AM +0200, Peter Zijlstra wrote:
> > > On Wed, 2008-10-08 at 18:14 -0700, Paul E. McKenney wrote:
> > > > When I enable:
> > > >
> > > > CONFIG_GROUP_SCHED=y
> > > > CONFIG_FAIR_GROUP_SCHED=y
> > > > CONFIG_USER_SCHED=y
> > > >
> > > > and run a bash script onlining and offlining CPUs in an infinite loop
> > > > on x86 using 2.6.27-rc9, after about 1.5 hours I get the following.
> Paul,
>
> Wuld you like to share your scipt? I tested cpu hotplug on my 8-core machine by
> unplug cpu 2~5 and plug them in a loop for one night and didn't trigger the issue.

See attached! I hand-edit the loop for the machine at hand, so on an
8-CPU x86 machine I would use "for ((i = 1; i < 8; i++))", given that
x86 machines tend not to allow you to offline CPU 0.

> Did you set CONFIG_RT_GROUP_SCHED=y?

No, I did not.

Thanx, Paul

> > > > On the off-chance that this is new news...
> > >
> > > Hmm, yes. I thought I had all those fixed :-(
> >
> > I know that feeling!!! ;-)
> >
> > > > [ 5538.091011] kernel BUG at kernel/sched_rt.c:322!
> > > > [ 5538.091011] invalid opcode: 0000 [#1] SMP
> > > > [ 5538.091011] Modules linked in:
> > > > [ 5538.091011]
> > > > [ 5538.091011] Pid: 2819, comm: sh Not tainted (2.6.27-rc9-autokern1 #1)
> > > > [ 5538.091011] EIP: 0060:[<c011c287>] EFLAGS: 00010002 CPU: 7
> > > > [ 5538.091011] EIP is at __disable_runtime+0x1c7/0x1d0
> > > > [ 5538.091011] EAX: c9056eec EBX: 00000001 ECX: 00000008 EDX: 00006060
> > > > [ 5538.091011] ESI: 02faf080 EDI: 00000000 EBP: f6df7cd0 ESP: f6df7ca8
> > > > [ 5538.091011] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > > > [ 5538.091011] Process sh (pid: 2819, ti=f6df6000 task=f6cbdc00 task.ti=f6df6000)
> > > > [ 5538.091011] Stack: f68c8004 c9056eec f68c8000 c9056b98 00000008 5d353631 c04d0020 c9056b00
> > > > [ 5538.091011] c9056b00 c9056b00 f6df7cdc c011d151 c037dfc0 f6df7cec c011aedb f68c8000
> > > > [ 5538.091011] c04d2200 f6df7d04 c011f967 00000282 00000000 00000000 00000000 f6df7e48
> > > > [ 5538.091011] Call Trace:
> > > > [ 5538.091011] [<c011d151>] ? rq_offline_rt+0x21/0x60
> > > > [ 5538.091011] [<c011aedb>] ? set_rq_offline+0x2b/0x50
> > > > [ 5538.091011] [<c011f967>] ? rq_attach_root+0xa7/0xb0
> > > > [ 5538.091011] [<c0120bbf>] ? cpu_attach_domain+0x30f/0x490
> > >
> > > At the very least we're doing part of the offline process twice it
> > > seems, once through set_rq_offline()/set_rq_online() and once through
> > > disable_runtime()/enabled_runtime().
> > >
> > > But seeing as we set an offlined cpu's runtime to RUNTIME_INF and skip
> > > cpus with RUNTIME_INF runtime that should be harmless.
> >
> > Would double-processing a non-offlined CPU cause trouble, perhaps
> > setting the runtime to a nonsensical value?
> >
> > > Modifications to rt_rq->rt_runtime are all done while holding
> > > rt_b->rt_runtime_lock and rt_rq->rt_runtime_lock (do_balance_runtime()
> > > and __disable_runtime() and __enable_runtime()). Which means its enough
> > > to hold either of those locks in order to get a stable reading of the
> > > value.
> These locks, especially ïrt_b->rt_runtime_lock, prevent the simultaneous
> changing of ïrt_runtime. It looks codes are ok.
>
> Anything related to RCU?
>
> > >
> > > Which leaves me puzzled for the moment...
> >
> > I know that feeling as well...
> >
> > > tip/master has the following commit to clarify the code somewhat:
> > >
> > >
> > > commit 78333cdd0e472180743d35988e576d6ecc6f6ddb
> > > Author: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
> > > Date: Tue Sep 23 15:33:43 2008 +0200
> > >
> > > sched: add some comments to the bandwidth code
> > >
> > > Hopefully clarify some of this code a little.
> > >
> > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
> > > Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
> > >
> > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > > index 2e228bd..d570a8c 100644
> > > --- a/kernel/sched_rt.c
> > > +++ b/kernel/sched_rt.c
> > > @@ -231,6 +231,9 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
> > > #endif /* CONFIG_RT_GROUP_SCHED */
> > >
> > > #ifdef CONFIG_SMP
> > > +/*
> > > + * We ran out of runtime, see if we can borrow some from our neighbours.
> > > + */
> >
> > Suppose that all CPUs nearby have run out of runtime. Or is that
> > possible?
> >
> > Thanx, Paul
> >
> > > static int do_balance_runtime(struct rt_rq *rt_rq)
> > > {
> > > struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
> > > @@ -250,9 +253,18 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
> > > continue;
> > >
> > > spin_lock(&iter->rt_runtime_lock);
> > > + /*
> > > + * Either all rqs have inf runtime and there's nothing to steal
> > > + * or __disable_runtime() below sets a specific rq to inf to
> > > + * indicate its been disabled and disalow stealing.
> > > + */
> > > if (iter->rt_runtime == RUNTIME_INF)
> > > goto next;
> > >
> > > + /*
> > > + * From runqueues with spare time, take 1/n part of their
> > > + * spare time, but no more than our period.
> > > + */
> > > diff = iter->rt_runtime - iter->rt_time;
> > > if (diff > 0) {
> > > diff = div_u64((u64)diff, weight);
> > > @@ -274,6 +286,9 @@ next:
> > > return more;
> > > }
> > >
> > > +/*
> > > + * Ensure this RQ takes back all the runtime it lend to its neighbours.
> > > + */
> > > static void __disable_runtime(struct rq *rq)
> > > {
> > > struct root_domain *rd = rq->rd;
> > > @@ -289,17 +304,33 @@ static void __disable_runtime(struct rq *rq)
> > >
> > > spin_lock(&rt_b->rt_runtime_lock);
> > > spin_lock(&rt_rq->rt_runtime_lock);
> > > + /*
> > > + * Either we're all inf and nobody needs to borrow, or we're
> > > + * already disabled and thus have nothing to do, or we have
> > > + * exactly the right amount of runtime to take out.
> > > + */
> > > if (rt_rq->rt_runtime == RUNTIME_INF ||
> > > rt_rq->rt_runtime == rt_b->rt_runtime)
> > > goto balanced;
> > > spin_unlock(&rt_rq->rt_runtime_lock);
> > >
> > > + /*
> > > + * Calculate the difference between what we started out with
> > > + * and what we current have, that's the amount of runtime
> > > + * we lend and now have to reclaim.
> > > + */
> > > want = rt_b->rt_runtime - rt_rq->rt_runtime;
> > >
> > > + /*
> > > + * Greedy reclaim, take back as much as we can.
> > > + */
> > > for_each_cpu_mask(i, rd->span) {
> > > struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
> > > s64 diff;
> > >
> > > + /*
> > > + * Can't reclaim from ourselves or disabled runqueues.
> > > + */
> > > if (iter == rt_rq || iter->rt_runtime == RUNTIME_INF)
> > > continue;
> > >
> > > @@ -319,8 +350,16 @@ static void __disable_runtime(struct rq *rq)
> > > }
> > >
> > > spin_lock(&rt_rq->rt_runtime_lock);
> > > + /*
> > > + * We cannot be left wanting - that would mean some runtime
> > > + * leaked out of the system.
> > > + */
> > > BUG_ON(want);
> > > balanced:
> > > + /*
> > > + * Disable all the borrow logic by pretending we have inf
> > > + * runtime - in which case borrowing doesn't make sense.
> > > + */
> > > rt_rq->rt_runtime = RUNTIME_INF;
> > > spin_unlock(&rt_rq->rt_runtime_lock);
> > > spin_unlock(&rt_b->rt_runtime_lock);
> > > @@ -343,6 +382,9 @@ static void __enable_runtime(struct rq *rq)
> > > if (unlikely(!scheduler_running))
> > > return;
> > >
> > > + /*
> > > + * Reset each runqueue's bandwidth settings
> > > + */
> > > for_each_leaf_rt_rq(rt_rq, rq) {
> > > struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
>
>

Attachment: onofftorture128.sh
Description: onofftorture128.sh