Re: frequent lockups in 3.18rc4

From: John Stultz
Date: Fri Jan 02 2015 - 19:27:25 EST


On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones <davej@xxxxxxxxxxxxxxxxx> wrote:
>> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>
>> > One thing I think I'll try is to try and narrow down which
>> > syscalls are triggering those "Clocksource hpet had cycles off"
>> > messages. I'm still unclear on exactly what is doing
>> > the stomping on the hpet.
>>
>> First I ran trinity with "-g vm" which limits it to use just
>> a subset of syscalls, specifically VM related ones.
>> That triggered the messages. Further experiments revealed:
>
> So I can trigger the false positives with my original patch quite
> easily by just putting my box under some load. My numbers are nowhere
> near as bad as yours, but then, I didn't put it under as much load
> anyway. Just a regular "make -j64" of the kernel.
>
> I suspect your false positives are bigger partly because of the load,
> but mostly because you presumably have preemption enabled too. I don't
> do preemption in my normal kernels, and that limits the damage of the
> race a bit.
>
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
>
> The fact that the original racy patch seems to make a difference for
> you does say that yes, we seem to be zeroing in on the right area
> here, but I'm not seeing what's wrong. I was hoping for big jumps from
> your HPET, since your "TSC unstable" messages do kind of imply that
> such really big jumps can happen.
>
> I'm attaching my updated hacky patch, although I assume it's much too
> late for that machine. Don't look too closely at the backtrace
> generation part, that's just a quick hack, and only works with frame
> pointers enabled anyway.
>
> So I'm still a bit unhappy about not figuring out *what* is wrong. And
> I'd still like the dmidecode from that machine, just for posterity. In
> case we can figure out some pattern.
>
> So right now I can imagine several reasons:
>
> - actual hardware bug.
>
> This is *really* unlikely, though. It should hit everybody. The
> HPET is in the core intel chipset, we're not talking random unusual
> hardware by fly-by-night vendors here.
>
> - some SMM/BIOS "power management" feature.
>
> We've seen this before, where the SMM saves/restores the TSC on
> entry/exit in order to hide itself from the system. I could imagine
> similar code for the HPET counter. SMM writers use some bad drugs to
> dull their pain.
>
> And with the HPET counter, since it's not even per-CPU, the "save
> and restore HPET" will actually show up as "HPET went backwards" to
> the other non-SMM CPU's if it happens
>
> - a bug in our own clocksource handling.
>
> I'm not seeing it. But maybe my patch hides it for some magical reason.

So I sent out a first step validation check to warn us if we end up
with idle periods that are larger then we expect.

It doesn't yet cap the timekeeping_get_ns() output (like you're patch
effectively does), but it would be easy to do that in a following
patch.

I did notice while testing this that the max_idle_ns (max idle time we
report to the scheduler) for the hpet is only ~16sec, and we'll
overflow after just ~21seconds. This second number maps closely to the
22 second stalls seen in the nmi watchdog reports which seems
interesting, but I also realize that qemu uses a 100MHz hpet, where as
real hardware is likely to be a bit slower, so maybe that's just
chance..

I'd be interested if folks seeing anything similar to Dave would give
my patch a shot.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/