workqueue code needing preemption disabled

From: Steven Rostedt
Date: Mon Mar 18 2013 - 10:36:34 EST


Hi Tejun,

I'm debugging a crash on -rt that has the following:

kernel BUG at kernel/sched/core.c:1731!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU 5
Pid: 16637, comm: kworker/5:0 Not tainted 3.6.11-rt30.25.el6rt.x86_64 #1 HP ProLiant DL580 G7
RIP: 0010:[<ffffffff8151ebea>] [<ffffffff8151ebea>] __schedule+0x89a/0x8c0
RSP: 0018:ffff880fec355c30 EFLAGS: 00010006
RAX: ffff880fff951900 RBX: ffff880fff951900 RCX: ffffffffff48fb8a
RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000000
RBP: ffff880fec355cc0 R08: 0000000000000001 R09: 0000000000000004
R10: 0000000000000004 R11: 0000000000000002 R12: 0000000000000005
R13: ffff880f61b417a0 R14: ffff883fff051900 R15: ffff880fec355d00
FS: 0000000000000000(0000) GS:ffff880fff940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003e0e98bd30 CR3: 0000000fe0348000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/5:0 (pid: 16637, threadinfo ffff880fec354000, task ffff880fe46d8000)
Stack:
ffff880fea074d80 ffff880fec354010 ffff880fec354000 ffff880fec354010
ffff880fec354000 ffff880fec354010 ffff880fec354000 ffff880fec354010
ffff880fec354000 ffff880fec355fd8 0000000000000286 ffff880fe46d8000
Call Trace:
[<ffffffff8151ed69>] schedule+0x29/0x70
[<ffffffff8151f8ed>] rt_spin_lock_slowlock+0x10d/0x310
[<ffffffff81240500>] ? ioc_destroy_icq+0xe0/0xe0
[<ffffffff81240500>] ? ioc_destroy_icq+0xe0/0xe0
[<ffffffff815200e6>] rt_spin_lock+0x26/0x30
[<ffffffff8106418b>] process_one_work+0x1ab/0x560
[<ffffffff81065f3b>] worker_thread+0x16b/0x510
[<ffffffff8151e76b>] ? __schedule+0x41b/0x8c0
[<ffffffff81065dd0>] ? manage_workers+0x340/0x340
[<ffffffff8106b246>] kthread+0x96/0xa0
[<ffffffff81528664>] kernel_thread_helper+0x4/0x10
[<ffffffff8106b1b0>] ? kthreadd+0x1e0/0x1e0
[<ffffffff81528660>] ? gs_change+0xb/0xb
Code: c4 01 00 00 00 00 00 40 e9 86 f8 ff ff 83 be 90 02 00 00 00 0f 85
20 f8 ff ff 48 89 f7 e8 df a0 b5 ff e9 13 f8 ff ff 0f 0b eb fe <0f> 0b
0f 1f 40 00 eb fa e8 d9 00 00 00 e9 07 fe ff ff 0f 0b 66

The bug occurred on this line:

static void try_to_wake_up_local(struct task_struct *p)
{
struct rq *rq = task_rq(p);

BUG_ON(rq != this_rq()); <---- bug here
BUG_ON(p == current);
lockdep_assert_held(&rq->lock);

if (!raw_spin_trylock(&p->pi_lock)) {
raw_spin_unlock(&rq->lock);
raw_spin_lock(&p->pi_lock);
raw_spin_lock(&rq->lock);
}


Now in your code you have the comment:

* X: During normal operation, modification requires gcwq->lock and
* should be done only from local cpu. Either disabling preemption
* on local cpu or grabbing gcwq->lock is enough for read access.
* If GCWQ_DISASSOCIATED is set, it's identical to L.

struct worker has flags marked with X.
struct worker_pool has flags and idle_list marked with X.

spin_locks in -rt do not disable preemption, nor do they disable irqs,
but they do disable migration. If there's code that depends on the
spin_lock disabling preemption, we need to either change the code to not
require that, or explicitly disable preemption in the critical paths.
Note, if we explicitly disable preemption, we can not call spin_locks
within those locations as in -rt a spin_lock can block and schedule.

I've tried to figure out the code but I'm not familiar with it enough to
know where the issues are as of yet. I was hoping that you could point
me at the trouble areas that would cause us issues when spin_locks() do
not disable preemption.

Thanks!

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/