Re: crash in active_task, 3.0-rc6+

From: Peter Zijlstra
Date: Tue Jul 19 2011 - 07:08:55 EST


On Mon, 2011-07-11 at 11:01 -0700, Ben Greear wrote:
> This is from a lightly patched kernel (nfs related patches)
> Same system sees lockdep warnings related to scheduling that
> I've been posting as well...
>
> I haven't seen anything similar in older kernels, so this may
> be a regression. We tested .38 in this setup without seeing it,
> but havn't tested .39.
>
> [root@simech2 ~]# BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
> IP: [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c
> PGD 127082067 PUD 127081067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 2
> Modules linked in: 8021q garp xt_addrtype xt_TPROXY nf_tproxy_core xt_socket nf_defrag_ipv6 xt_set ip_set nfnetlink xt_connlimit ip]
>
> Pid: 2203, comm: gnuserver Not tainted 3.0.0-rc6+ #20 Supermicro X7DBU/X7DBU
> RIP: 0010:[<ffffffff81231bf8>] [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c
> RSP: 0018:ffff8801104e3628 EFLAGS: 00010096
> RAX: ffff880126b9e7e0 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffff8801104e3668 R08: 0000000000000002 R09: 0000000000000001
> R10: 000002bf7b96a212 R11: ffff880107481e30 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000002
> FS: 00007f213f07c720(0000) GS:ffff88012fc80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000004 CR3: 0000000110519000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process gnuserver (pid: 2203, threadinfo ffff8801104e2000, task ffff880126b9e7e0)
> Stack:
> 0000000000000018 0000000000000046 00000000bf3d0800 0000000000000000
> 0000000000000003 0000000000000000 0000000000000046 0000000000000002
> ffff8801104e3698 ffffffff8147e459 ffffffff810a7f25 ffffffff8103d2f7
> Call Trace:
> [<ffffffff8147e459>] _raw_spin_lock+0x3e/0x45
> [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e
> [<ffffffff8103d2f7>] ? activate_task+0x30/0x30
> [<ffffffff810a7f25>] __rcu_read_unlock+0xdc/0x24e
> [<ffffffff8103d34d>] rcu_read_unlock+0x21/0x23
> [<ffffffff8103efd7>] select_task_rq_fair+0x8cc/0x8e1
> [<ffffffff81042b59>] ? finish_task_switch+0x78/0xf1
> [<ffffffff81042b1b>] ? finish_task_switch+0x3a/0xf1
> [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0
> [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0
> [<ffffffff81040c52>] select_task_rq+0x13/0x44
> [<ffffffff8104697f>] try_to_wake_up+0xd8/0x1a0
> [<ffffffff8103c783>] ? __wake_up+0x1d/0x48
> [<ffffffff81046a54>] default_wake_function+0xd/0xf
> [<ffffffff8106751a>] autoremove_wake_function+0x13/0x38
> [<ffffffff810395d0>] __wake_up_common+0x49/0x7f
> [<ffffffff8103c79a>] __wake_up+0x34/0x48
> [<ffffffff810a74a9>] rcu_report_exp_rnp+0x50/0x89
> [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e
> [<ffffffff810a8032>] __rcu_read_unlock+0x1e9/0x24e
> [<ffffffff813fce40>] rcu_read_unlock+0x21/0x23
> [<ffffffff813fd52d>] ip_queue_xmit+0x35e/0x3b1
> [<ffffffff813fd1cf>] ? ip_send_reply+0x247/0x247
> [<ffffffff8140f5f3>] tcp_transmit_skb+0x785/0x7c3
> [<ffffffff81411e23>] tcp_write_xmit+0x806/0x8f5
> [<ffffffff810e646f>] ? might_fault+0x4e/0x9e
> [<ffffffff81403e25>] ? copy_from_user+0x2a/0x2c
> [<ffffffff81411f63>] __tcp_push_pending_frames+0x20/0x4d
> [<ffffffff8140411f>] tcp_push+0x84/0x86
> [<ffffffff81406577>] tcp_sendmsg+0x674/0x775
> [<ffffffff81423d68>] inet_sendmsg+0x61/0x6a
> [<ffffffff813af67a>] __sock_sendmsg_nosec+0x58/0x61
> [<ffffffff813b0db5>] __sock_sendmsg+0x3d/0x48
> [<ffffffff813b1631>] sock_sendmsg+0xa3/0xbc
> [<ffffffff810ea6ae>] ? handle_pte_fault+0x7fc/0x84d
> [<ffffffff81110b14>] ? mem_cgroup_get_limit+0x45/0x45
> [<ffffffff8111a4e9>] ? fget_light+0x35/0xac
> [<ffffffff813b16b2>] ? sockfd_lookup_light+0x1b/0x53
> [<ffffffff813b1bf5>] sys_sendto+0xfa/0x11f
> [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13
> [<ffffffff8109f0a1>] ? audit_syscall_entry+0x119/0x145
> [<ffffffff81484d52>] system_call_fastpath+0x16/0x1b


Fun, that's a recursion we've been poking at for the past few days, but
I haven't seen it do a NULL-ptr deref like this yet.

I'm also not quite sure the below patches will catch this particular
recursion, it might be you need the patch from
lkml.kernel.org/r/20110718151524.GA4236@xxxxxxxxxxxxxxxxxx as well,
since this is RCU recursing on itself without the aid of interrupts.

How reproducable is this? And if so, what's your .config and recipe for
making it go bang?
Subject: rcu, softirq: Avoid rcu_read_unlock_special vs softirq recursion
From: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Date: Fri, 15 Jul 2011 17:04:27 +0200

$task IRQ SoftIRQ

rcu_read_lock()

/* do stuff */

<preempt> |= UNLOCK_BLOCKED

rcu_read_unlock()
--t->rcu_read_lock_nesting

irq_enter();
/* do stuff, don't use RCU */
irq_exit();
sub_preempt_count(IRQ_EXIT_OFFSET);
invoke_softirq()

ttwu();
spin_lock_irq(&pi->lock)
rcu_read_lock();
/* do stuff */
rcu_read_unlock();
rcu_read_unlock_special()
rcu_report_exp_rnp()
ttwu()
spin_lock_irq(&pi->lock) /* deadlock */

rcu_read_unlock_special(t);

Ed can simply trigger this 'easy' because invoke_softirq() immediately
does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
first, but even without that the above can happen.

Cure this by also excluding softirqs from the
rcu_read_unlock_special() handler and ensuring the force_irqthreads
ksoftirqd/# wakeup is done from full softirq context.

[ Alternatively, delaying the ->rcu_read_lock_nesting decrement
until after the special handling would make the thing more robust
in the face of interrupts as well. ]

Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Reported-and-tested-by: Ed Tomlinson <edt@xxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Link: http://lkml.kernel.org/r/1310742267.2586.353.camel@twins
---
kernel/rcutree_plugin.h | 2 +-
kernel/softirq.c | 12 ++++++++++--
2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 14dc7dd..373c9c8 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -309,7 +309,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
}

/* Hardware IRQ handlers cannot block. */
- if (in_irq()) {
+ if (in_irq() || in_serving_softirq()) {
local_irq_restore(flags);
return;
}
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 40cf63d..fca82c3 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -315,16 +315,24 @@ static inline void invoke_softirq(void)
{
if (!force_irqthreads)
__do_softirq();
- else
+ else {
+ __local_bh_disable((unsigned long)__builtin_return_address(0),
+ SOFTIRQ_OFFSET);
wakeup_softirqd();
+ __local_bh_enable(SOFTIRQ_OFFSET);
+ }
}
#else
static inline void invoke_softirq(void)
{
if (!force_irqthreads)
do_softirq();
- else
+ else {
+ __local_bh_disable((unsigned long)__builtin_return_address(0),
+ SOFTIRQ_OFFSET);
wakeup_softirqd();
+ __local_bh_enable(SOFTIRQ_OFFSET);
+ }
}
#endif


Subject: sched: Add irq_{enter,exit}() to scheduler_ipi()
From: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Date: Mon, 18 Jul 2011 11:29:44 +0200

Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual
work. Traditionally we never did any actual work from the resched IPI
and all magic happened in the return from interrupt path.

Now that we do do some work, we need to ensure irq_{enter,exit} are
called so that we don't confuse things.

This affects things like timekeeping, NO_HZ and RCU, basically
everything with a hook in irq_enter/exit.

Explicit examples of things going wrong are:

sched_clock_cpu() -- has a callback when leaving NO_HZ state to take
a new reading from GTOD and TSC. Without this
callback, time is stuck in the past.

RCU -- needs in_irq() to work in order to avoid some nasty deadlocks

Acked-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Link: 1310981384.13765.40.camel@twins">http://lkml.kernel.org/r/1310981384.13765.40.camel@twins
---
kernel/sched.c | 40 ++++++++++++++++++++++++++++++++++------
1 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8fb4245..eb9cbe7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2544,13 +2544,9 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
}

#ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+static void sched_ttwu_do_pending(struct task_struct *list)
{
struct rq *rq = this_rq();
- struct task_struct *list = xchg(&rq->wake_list, NULL);
-
- if (!list)
- return;

raw_spin_lock(&rq->lock);

@@ -2563,9 +2559,41 @@ static void sched_ttwu_pending(void)
raw_spin_unlock(&rq->lock);
}

+static void sched_ttwu_pending(void)
+{
+ struct rq *rq = this_rq();
+ struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+ if (!list)
+ return;
+
+ sched_ttwu_do_pending(list);
+}
+
void scheduler_ipi(void)
{
- sched_ttwu_pending();
+ struct rq *rq = this_rq();
+ struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+ if (!list)
+ return;
+
+ /*
+ * Not all reschedule IPI handlers call irq_enter/irq_exit, since
+ * traditionally all their work was done from the interrupt return
+ * path. Now that we actually do some work, we need to make sure
+ * we do call them.
+ *
+ * Some archs already do call them, luckily irq_enter/exit nest
+ * properly.
+ *
+ * Arguably we should visit all archs and update all handlers,
+ * however a fair share of IPIs are still resched only so this would
+ * somewhat pessimize the simple resched case.
+ */
+ irq_enter();
+ sched_ttwu_do_pending(list);
+ irq_exit();
}

static void ttwu_queue_remote(struct task_struct *p, int cpu)