Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"

From: Johannes Berg
Date: Tue Oct 23 2018 - 16:25:00 EST


On Tue, 2018-10-23 at 21:50 +0200, Johannes Berg wrote:
>
> The reason for the report is that with the workqueue annotations, we've
> added new links to the chains that lockdep sees. I _think_ those
> annotations are correct and only create links in the chain when they are
> actually present, but since those links are between *classes* not
> *instances* these new links may cause false positives.

So over in the nvmet-rdma thread you asked:

> Are the lockdep annotations in kernel/workqueue.c correct? My understanding
> is that lock_map_acquire() should only be used to annotate mutually exclusive
> locking (e.g. mutex, spinlock). However, multiple work items associated with
> the same workqueue can be executed concurrently. From lockdep.h:
>
> #define lock_map_acquire(l) lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)

Yes, I will admit that I'm not entirely sure that they are indeed
correct, because there are complexities with single-threaded/ordered and
multi-threaded workqueues.

The annotations are like this though:

1) when the work runs, we have

lock_map_acquire(&pwq->wq->lockdep_map);
lock_map_acquire(&work->lockdep_map);

call_work_function()

lock_map_release(&work->lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);

(where &work->lockdep_map is &lockdep_map in the actual code, but that's
a copy of it and we just use it because the work might free itself, it's
not relevant for lockdep's purposes)

This should be OK, since if you're running the work struct on a given
work queue you've undoubtedly caused a dependency chain of
workqueue -> work item -> anything in the worker function.

Note that this is actually limited, see the comments around this code.


2) flush_workqueue does:

lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);

which causes a chain
whatever the caller holds -> workqueue.

This should similarly be OK. Note that this already lets you find bugs,
because if you flush a workqueue that might run a work that takes a
lock, while holding the same lock, you deadlock:

CPU0 CPU1
lock(A)
run work X on W:
lock(A) // waiting to acquire
flush_workqueue(W)

-> deadlock, but flagged by lockdep even if typically work X finishes
well before CPU0 gets to lock(A).


3) start_flush_work/__flush_work

These are a bit more tricky, and actually fairly limited.

The &work->lockdep_map ones here are actually straight-forward - if you
try to wait for a work item to finish while holding a lock that this
work item might also take, you need this dependency to flag it in
lockdep.

The &pwq->wq->lockdep_map are a bit trickier. Basically, they let you
catch things like

workqueue W
work items S1, S2
lock A

W is single-threaded/ordered and can execute both S1 and S2.

executing S1 creates: W -> S1 -> A // let's say S1 locks A
executing S2 creates: W -> S2

Now if you do

lock(A);
flush_work(S2);

that seems fine, on the face of it, since S2 doesn't lock A. However,
there's a problem now. If S1 was scheduled to run before S2 but hasn't
completed yet, it might be waiting for lock(A) inside S1, and S2 is
waiting for S1 because the workqueue is single-threaded.

The way this is expressed in lockdep is that the flush side will
additionally create a dependency chain of

A -> W

with lock_map_acquire in start_flush_work(), given the right
circumstances. Together with the chains created above, you'd get a
report of

A -> W -> S1 -> A

which can cause a deadlock, even though you were trying to flush S2 and
that doesn't even show up here! This is one of the trickier reports to
actually understand, but we've had quite a few of these in the past.

I think all the conditions here are fine, and Tejun did look at them I
believe.

Now, I said this was limited - the problem here is that
start_flush_work() will only ever find a pwq in a limited amount of
scenarios, which basically means that we're creating fewer chain links
than we should. We just don't know that workqueue this item might
execute on unless it's currently pending, so if it typically completes
by the time you get to flush_work(S2), this code can no longer create
the "A -> W" dependency and will miss issues in a significant number of
cases, since that's really the typical case of flushing a work (that
it's not currently pending or running). With the "A -> W" dependency
missing we're left with just the chains

W -> S1 -> A
W -> S2
A -> S2

and don't see any problem even though it clearly exists. The original
code before Tejun's workqueue refactoring was actually better in this
regard, I believe.

Come to think of it, perhaps we could address it by turning around the
dependencies when we run the work? then we'd end up with

S1 -> W -> A
S2 -> W
A -> S2

and actually _have_ a link from A back to itself in this case. But I'd
have to analyse the other cases in more detail to see whether that makes
sense. However, I don't think we depend on the S1 <-> W links much right
now, so it's tempting to do this, let's say next year after the
currently flagged issues are more under control :-)


As for your question regarding exclusive, I must admit that I don't
recall exactly why that was chosen. I want to believe I discussed that
with PeterZ back when I originally added the workqueue lockdep
annotations (before they were mostly lost), but I couldn't point you to
any discussions (which probably were on IRC anyway), and I can't say I
remember the exact details of which type of lockdep acquire to choose at
what time. I do seem to recall that it's not so much about exclusiveness
but about the type of tracking that happens internally regarding things
like lock recursion.

johannes