Re: [ANNOUNCE] 3.12.6-rt9

From: Sebastian Andrzej Siewior
Date: Fri Jan 17 2014 - 12:01:04 EST


* Mike Galbraith | 2013-12-24 16:47:47 [+0100]:

>I built this kernel with Paul's patch and NO_HZ_FULL enabled again on 64
>core box. I haven't seen RCU grip yet, but I just checked on it after
>3.5 hours into this boot/beat (after fixing crash+kdump setup), and
>found it in the process of dumping.

So you also have the timers-do-not-raise-softirq-unconditionally.patch?

>crash> bt
>PID: 508 TASK: ffff8802739ba340 CPU: 16 COMMAND: "ksoftirqd/16"
> #0 [ffff880276806a40] machine_kexec at ffffffff8103bc07
> #1 [ffff880276806aa0] crash_kexec at ffffffff810d56b3
> #2 [ffff880276806b70] panic at ffffffff815bf8b0
> #3 [ffff880276806bf0] watchdog_overflow_callback at ffffffff810fed3d
> #4 [ffff880276806c10] __perf_event_overflow at ffffffff81131928
> #5 [ffff880276806ca0] perf_event_overflow at ffffffff81132254
> #6 [ffff880276806cb0] intel_pmu_handle_irq at ffffffff8102078f
> #7 [ffff880276806de0] perf_event_nmi_handler at ffffffff815c5825
> #8 [ffff880276806e10] nmi_handle at ffffffff815c4ed3
> #9 [ffff880276806ea0] default_do_nmi at ffffffff815c5063
>#10 [ffff880276806ed0] do_nmi at ffffffff815c5388
>#11 [ffff880276806ef0] end_repeat_nmi at ffffffff815c4371
> [exception RIP: _raw_spin_trylock+48]
> RIP: ffffffff815c3790 RSP: ffff880276803e28 RFLAGS: 00000002
> RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000002
> RDX: ffff880276803e28 RSI: 0000000000000018 RDI: 0000000000000001
> RBP: ffffffff815c3790 R8: ffffffff815c3790 R9: 0000000000000018
> R10: ffff880276803e28 R11: 0000000000000002 R12: ffffffffffffffff
> R13: ffff880273a0c000 R14: ffff8802739ba340 R15: ffff880273a03fd8
> ORIG_RAX: ffff880273a03fd8 CS: 0010 SS: 0018
>--- <RT exception stack> ---
>#12 [ffff880276803e28] _raw_spin_trylock at ffffffff815c3790
>#13 [ffff880276803e30] rt_spin_lock_slowunlock_hirq at ffffffff815c2cc8
>#14 [ffff880276803e50] rt_spin_unlock_after_trylock_in_irq at ffffffff815c3425
>#15 [ffff880276803e60] get_next_timer_interrupt at ffffffff810684a7
>#16 [ffff880276803ed0] tick_nohz_stop_sched_tick at ffffffff810c5f2e
>#17 [ffff880276803f50] tick_nohz_irq_exit at ffffffff810c6333
>#18 [ffff880276803f70] irq_exit at ffffffff81060065
>#19 [ffff880276803f90] smp_apic_timer_interrupt at ffffffff810358f5
>#20 [ffff880276803fb0] apic_timer_interrupt at ffffffff815cbf9d
>--- <IRQ stack> ---
>#21 [ffff880273a03b28] apic_timer_interrupt at ffffffff815cbf9d
> [exception RIP: _raw_spin_lock+50]
> RIP: ffffffff815c3642 RSP: ffff880273a03bd8 RFLAGS: 00000202
> RAX: 0000000000008b49 RBX: ffff880272157290 RCX: ffff8802739ba340
> RDX: 0000000000008b4a RSI: 0000000000000010 RDI: ffff880273a0c000
> RBP: ffff880273a03bd8 R8: 0000000000000001 R9: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff810927b5
> R13: ffff880273a03b68 R14: 0000000000000010 R15: 0000000000000010
> ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
>#22 [ffff880273a03be0] rt_spin_lock_slowlock at ffffffff815c2591
>#23 [ffff880273a03cc0] rt_spin_lock at ffffffff815c3362
>#24 [ffff880273a03cd0] run_timer_softirq at ffffffff81069002
>#25 [ffff880273a03d70] handle_softirq at ffffffff81060d0f
>#26 [ffff880273a03db0] do_current_softirqs at ffffffff81060f3c
>#27 [ffff880273a03e20] run_ksoftirqd at ffffffff81061045
>#28 [ffff880273a03e40] smpboot_thread_fn at ffffffff81089c31
>#29 [ffff880273a03ec0] kthread at ffffffff810807fe
>#30 [ffff880273a03f50] ret_from_fork at ffffffff815cb28c
>crash> gdb list *0xffffffff815c2591
>0xffffffff815c2591 is in rt_spin_lock_slowlock (kernel/rtmutex.c:109).
>104 }
>105 #endif
>106
>107 static inline void init_lists(struct rt_mutex *lock)
>108 {
>109 if (unlikely(!lock->wait_list.node_list.prev))
>110 plist_head_init(&lock->wait_list);
>111 }
>112
>113 /*
>crash> gdb list *0xffffffff815c2590
>0xffffffff815c2590 is in rt_spin_lock_slowlock (kernel/rtmutex.c:744).
>739 struct rt_mutex_waiter waiter, *top_waiter;
>740 int ret;
>741
>742 rt_mutex_init_waiter(&waiter, true);
>743
>744 raw_spin_lock(&lock->wait_lock);
>745 init_lists(lock);
>746
>747 if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
>748 raw_spin_unlock(&lock->wait_lock);
>crash> gdb list *0xffffffff815c2cc8
>0xffffffff815c2cc8 is in rt_spin_lock_slowunlock_hirq (kernel/rtmutex.c:851).
>846 {
>847 int ret;
>848
>849 do {
>850 ret = raw_spin_trylock(&lock->wait_lock);
>851 } while (!ret);
>852
>853 __rt_spin_lock_slowunlock(lock);
>854 }
>855
>
>Dang, Santa might have delivered a lock pick set in a few more hours.

I have a small problem with understanding thisâ

|#24 [ffff880273a03cd0] run_timer_softirq at ffffffff81069002

Here we obtain wait_lock from tvec_base of _this_ CPU. And we get to
init_lists() before the apic timer kicks in. So we have the wait_lock.
In the hard interrupt triggered by the apic timer we get to
get_next_timer_interrupt() and go again for same the wait_lock. Here we
have the try_lock so we avoid this deadlock.
The odd part: we get the lock. It should be the same lock because both use
| struct tvec_base *base = __this_cpu_read(tvec_bases);
to ge it. And we shouldn't get it because the lock is already hold.
We get into trouble in the unlock path where we spin forever:

|#14 [ffff880276803e50] rt_spin_unlock_after_trylock_in_irq at ffffffff815c3425
|#12 [ffff880276803e28] _raw_spin_trylock at ffffffff815c3790

which releases the lock with a trylock in order to keep lockdep happy.
My understanding was that we should be able to obtain the wait_lock here
since we were able to obtain it in the lock path and in irq off context
there is nothing that could take the lock in the meantime.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/