Re: 2.6.32.21 - uptime related crashes?

From: john stultz
Date: Thu Jul 21 2011 - 15:54:18 EST


On Thu, 2011-07-21 at 09:22 +0200, Ingo Molnar wrote:
> * john stultz <johnstul@xxxxxxxxxx> wrote:
>
> > On Fri, 2011-07-15 at 12:01 +0200, Peter Zijlstra wrote:
> > > On Thu, 2011-07-14 at 17:35 -0700, john stultz wrote:
> > > >
> > > > Peter/Ingo: Can you take a look at the above and let me know if you find
> > > > it too disagreeable?
> > >
> > > +static unsigned long long __cycles_2_ns(unsigned long long cyc)
> > > +{
> > > + unsigned long long ns = 0;
> > > + struct x86_sched_clock_data *data;
> > > + int cpu = smp_processor_id();
> > > +
> > > + rcu_read_lock();
> > > + data = rcu_dereference(per_cpu(cpu_sched_clock_data, cpu));
> > > +
> > > + if (unlikely(!data))
> > > + goto out;
> > > +
> > > + ns = ((cyc - data->base_cycles) * data->mult) >> CYC2NS_SCALE_FACTOR;
> > > + ns += data->accumulated_ns;
> > > +out:
> > > + rcu_read_unlock();
> > > + return ns;
> > > +}
> > >
> > > The way I read that we're still not wrapping properly if freq scaling
> > > 'never' happens.
> >
> > Right, this doesn't address the mult overflow behavior. As I mentioned
> > in the patch that the rework allows for solving that in the future using
> > a (possibly very rare) timer that would accumulate cycles to ns.
> >
> > This rework just really addresses the multiplication overflow->negative
> > roll under that currently occurs with the cyc2ns_offset value.
> >
> > > Because then we're wrapping on accumulated_ns + 2^54.
> > >
> > > Something like resetting base, and adding ns to accumulated_ns and
> > > returning the latter would make more sense.
> >
> > Although we have to update the base_cycles and accumulated_ns
> > atomically, so its probably not something to do in the sched_clock path.
>
> Ping, what's going on with this bug? Systems are crashing so we need
> a quick fix ASAP ...

I think Peter's patch disabling sched_clock_stable is a good approach
for now.

And just to clarify a bit here, while there was a related scheduler
division-by-zero issue which to my understanding has already been fixed
post-2.6.32.21, I have not actually seen any other crash logs connected
to the overflow.

There have been posted softlockup watchdog false-positive messages
(which I have also reproduced), but I've not seen any details on actual
crashes or have I been able to reproduce them using my forced-overflow
patch.

This isn't to say that the overflow isn't causing crashes, but that the
reports have not been clear that there have been crashes by something
other then the div-bv-zero issue.

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/