Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock

From: Oleg Nesterov
Date: Tue Mar 20 2007 - 12:04:43 EST


On 03/20, Jarek Poplawski wrote:
>
> >> On Fri, 16 Mar 2007 09:41:20 +0100 Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> >>
> >>> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> >>>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@xxxxxxxxxxxxxx> wrote:
> >>>>> ...
> >>>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> >>>>> [ 1756.728271] last function: laundromat_main+0x0/0x69 [nfsd]
> >>>>> [ 1756.728392] 2 locks held by nfsd4/3577:
> >>>>> [ 1756.728435] #0: (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> >>>>> [ 1756.728679] #1: (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> >>>>> [ 1756.728923] [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> >>>>> [ 1756.729015] [<c1003d7f>] show_trace+0x12/0x14
> >>>>> [ 1756.729103] [<c1003e79>] dump_stack+0x16/0x18
> >>>>> [ 1756.729187] [<c102c2e8>] run_workqueue+0x167/0x170
> >>>>> [ 1756.729276] [<c102c437>] worker_thread+0x146/0x165
> >>>>> [ 1756.729368] [<c102f797>] kthread+0x97/0xc4
> >>>>> [ 1756.729456] [<c1003bdb>] kernel_thread_helper+0x7/0x10
> >>>>> [ 1756.729547] =======================
> ...
> >>> I think I'm responsible for this message (commit
> >>> d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
> >>> function executed by the workqueue (here laundromat_main) leaked an
> >>> atomic context or is still holding locks (2 locks in this case).
>
> This check is valid with keventd, but it looks like nfsd runs
> kthread by itself. I'm not sure it's illegal to hold locks then,

nfsd creates laundry_wq by itself, yes, but cwq->thread runs with
lockdep_depth() == 0. Unless we have a bug with lockdep_depth(),
lockdep_depth() != 0 means that work->func() returns with a lock
held (or it can flush its own workqueue under lock, but in that case
we should have a different trace).

Personally I agree with Andrew:
>
> > OK. That's not necessarily a bug: one could envisage a (weird) piece of
> > code which takes a lock then releases it on a later workqueue invokation.

> @@ -323,13 +324,15 @@ static void run_workqueue(struct cpu_wor
> BUG_ON(get_wq_data(work) != cwq);
> if (!test_bit(WORK_STRUCT_NOAUTOREL, work_data_bits(work)))
> work_release(work);
> + ld = lockdep_depth(current);
> +
> f(work);
>
> - if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
> + if (unlikely(in_atomic() || (ld -= lockdep_depth(current)))) {

and with this change we will also have a BUG report on "then releases it on a
later workqueue invokation".

Btw. __nfs4_state_shutdown() does cancel_delayed_work(&laundromat_work).
This is unneeded and imho confusing. laundry_wq was already destroyed,
it would be BUG to have a pending delayed work.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/