Re: frequent lockups in 3.18rc4

From: Frederic Weisbecker
Date: Thu Nov 20 2014 - 10:08:10 EST


On Mon, Nov 17, 2014 at 12:03:59PM -0500, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 10:33:19PM -0800, Linus Torvalds wrote:
>
> > > > I'll try that next, and check in on it tomorrow.
> > >
> > > No luck. Died even faster this time.
> >
> > Yeah, and your other lockups haven't even been TLB related. Not that
> > they look like anything else *either*.
> >
> > I have no ideas left. I'd go for a bisection - rather than try random
> > things, at least bisection will get us a smaller set of suspects if
> > you can go through a few cycles of it. Even if you decide that you
> > want to run for most of a day before you are convinced it's all good,
> > a couple of days should get you a handful of bisection points (that's
> > assuming you hit a couple of bad ones too that turn bad in a shorter
> > while). And 4 or five bisections should get us from 11k commits down
> > to the ~600 commit range. That would be a huge improvement.
>
> Great start to the week: I decided to confirm my recollection that .17
> was ok, only to hit this within 10 minutes.
>
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
> 0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
> ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
> ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
> Call Trace:
> <NMI> [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
> [<ffffffff9583bcc0>] panic+0xd4/0x207
> [<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
> [<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
> [<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
> [<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
> [<ffffffff95194be4>] perf_event_overflow+0x14/0x20
> [<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
> [<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
> [<ffffffff95007bb2>] nmi_handle+0xd2/0x390
> [<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> [<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
> [<ffffffff950082a8>] do_nmi+0xb8/0x100
> [<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> <<EOE>> <IRQ> [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
> [<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0

Ah that one got fixed in the merge window and in -stable, right?

> [<ffffffff95101baa>] hrtimer_cancel+0x1a/0x30
> [<ffffffff95113557>] tick_nohz_restart+0x17/0x90
> [<ffffffff95114533>] __tick_nohz_full_check+0xc3/0x100
> [<ffffffff9511457e>] nohz_full_kick_work_func+0xe/0x10
> [<ffffffff95188894>] irq_work_run_list+0x44/0x70
> [<ffffffff951888ea>] irq_work_run+0x2a/0x50
> [<ffffffff9510109b>] update_process_times+0x5b/0x70
> [<ffffffff95113325>] tick_sched_handle.isra.20+0x25/0x60
> [<ffffffff95113801>] tick_sched_timer+0x41/0x60
> [<ffffffff95102281>] __run_hrtimer+0x81/0x480
> [<ffffffff951137c0>] ? tick_sched_do_timer+0xb0/0xb0
> [<ffffffff95102977>] hrtimer_interrupt+0x117/0x270
> [<ffffffff950346d7>] local_apic_timer_interrupt+0x37/0x60
> [<ffffffff9584c44f>] smp_apic_timer_interrupt+0x3f/0x50
> [<ffffffff9584a86f>] apic_timer_interrupt+0x6f/0x80
> <EOI> [<ffffffff950d3f3a>] ? lock_release_holdtime.part.28+0x9a/0x160
> [<ffffffff950ef3b7>] ? rcu_is_watching+0x27/0x60
> [<ffffffff9508cb75>] kill_pid_info+0xf5/0x130
> [<ffffffff9508ca85>] ? kill_pid_info+0x5/0x130
> [<ffffffff9508ccd3>] SYSC_kill+0x103/0x330
> [<ffffffff9508cc7c>] ? SYSC_kill+0xac/0x330
> [<ffffffff9519b592>] ? context_tracking_user_exit+0x52/0x1a0
> [<ffffffff950d6f1d>] ? trace_hardirqs_on_caller+0x16d/0x210
> [<ffffffff950d6fcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff950137ad>] ? syscall_trace_enter+0x14d/0x330
> [<ffffffff9508f44e>] SyS_kill+0xe/0x10
> [<ffffffff95849b24>] tracesys+0xdd/0xe2
> Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> It could a completely different cause for lockup, but seeing this now
> has me wondering if perhaps it's something unrelated to the kernel.
> I have recollection of running late .17rc's for days without incident,
> and I'm pretty sure .17 was ok too. But a few weeks ago I did upgrade
> that test box to the Fedora 21 beta. Which means I have a new gcc.
> I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
> get 4.8 back on there and see if that's any better.
>
> Dave
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/