Re: chain_key collision with 4.7.0-rec2+

From: Peter Zijlstra
Date: Tue Jun 21 2016 - 09:40:48 EST


On Tue, Jun 21, 2016 at 07:03:58AM -0600, David Ahern wrote:
> Peter/Ingo:
>
> I have this hit many times over the past few weeks, but I do not have a
> reliable reproducer to attempt a git bisect. There 2 dumps below I hit last
> night within a few minutes of each other (reboot in between). First dump is
> for ipv4 (first login to VM after boot) and second dump is for ipv6 (second
> login to VM after boot). Both are using net-next tree, top commit is
> 697666eac66.

Plz as to not wrap dmesg output when pasting

> IPv4 one:

> [ 189.171084] ======================
> [ 189.171555] [chain_key collision ]
> [ 189.172020] 4.7.0-rc2+ #310 Tainted: G W
> [ 189.172729] ----------------------
> [ 189.173192] sshd/1463: Hash chain already cached but the contents don't > match!
> [ 189.174236] Held locks:depth: 6
> [ 189.174707] class_idx:1253 -> chain_key:00000000000004e5 > (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff8144a84c>] lock_sock+0xb/0xd
> [ 189.176376] class_idx:26 -> chain_key:00000000009ca01a > (rcu_read_lock){......}, at: [<ffffffff814421fb>] rcu_lock_acquire+0x0/0x20
> [ 189.178014] class_idx:898 -> chain_key:0000001394034382 > (rcu_read_lock_bh){......}, at: [<ffffffff814421fb>] > rcu_lock_acquire+0x0/0x20
> [ 189.179712] class_idx:898 -> chain_key:0002728068704382 > (rcu_read_lock_bh){......}, at: [<ffffffff81400393>] > rcu_lock_acquire+0x0/0x20
> [ 189.181374] class_idx:1221 -> chain_key:4e500d0e087044c5 > (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at: > [<ffffffff81408a7e>] __dev_queue_xmit+0x40c/0x72d
> [ 189.183400] class_idx:1222 -> chain_key:01a1c10e0898ad0c > (_xmit_ETHER#2){+.-...}, at: [<ffffffff814287e6>] sch_direct_xmit+0x6f/0x167
> [ 189.185178] Locks in cached chain:depth: 6
> [ 189.185791] class_idx:1252 -> chain_key:00000000000004e4 > (((&ndev->rs_timer))){+.-...}
> [ 189.186957] class_idx:26 -> chain_key:00000000009c801a > (rcu_read_lock){......}
> [ 189.187991] class_idx:898 -> chain_key:0000001390034382 > (rcu_read_lock_bh){......}
> [ 189.189023] class_idx:898 -> chain_key:0002720068704382 > (rcu_read_lock_bh){......}
> [ 189.190050] class_idx:1221 -> chain_key:4e400d0e087044c5 > (dev->qdisc_running_key ?: &qdisc_running_key){+.....}
> [ 189.191370] class_idx:1220 -> chain_key:01a1c10e0898ad0c > (&(&list->lock)->rlock#3){+.-...}

> and IPv6:

> [ 22.082615] ======================
> [ 22.083126] [chain_key collision ]
> [ 22.083646] 4.7.0-rc2+ #310 Tainted: G W
> [ 22.084410] ----------------------
> [ 22.084927] sshd/1401: Hash chain already cached but the contents don't > match!
> [ 22.086010] Held locks:depth: 6
> [ 22.086509] class_idx:1255 -> chain_key:00000000000004e7 > (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff8144a84c>] lock_sock+0xb/0xd
> [ 22.088253] class_idx:26 -> chain_key:00000000009ce01a > (rcu_read_lock){......}, at: [<ffffffff814cd8a2>] rcu_read_lock+0x0/0x68
> [ 22.089999] class_idx:898 -> chain_key:000000139c034382 > (rcu_read_lock_bh){......}, at: [<ffffffff814a031b>] > rcu_lock_acquire+0x0/0x20
> [ 22.091811] class_idx:898 -> chain_key:0002738068704382 > (rcu_read_lock_bh){......}, at: [<ffffffff81400393>] > rcu_lock_acquire+0x0/0x20
> [ 22.093639] class_idx:1223 -> chain_key:4e700d0e087044c7 > (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at: > [<ffffffff81408a7e>] __dev_queue_xmit+0x40c/0x72d
> [ 22.095824] class_idx:1224 -> chain_key:01a1c10e0898ed06 > (_xmit_ETHER#2){+.-...}, at: [<ffffffff814287e6>] sch_direct_xmit+0x6f/0x167
> [ 22.097616] Locks in cached chain:depth: 6
> [ 22.098197] class_idx:1248 -> chain_key:00000000000004e0 > (((&ndev->rs_timer))){+.-...}
> [ 22.099345] class_idx:26 -> chain_key:00000000009c001a > (rcu_read_lock){......}
> [ 22.100419] class_idx:898 -> chain_key:0000001380034382 > (rcu_read_lock_bh){......}
> [ 22.101535] class_idx:898 -> chain_key:0002700068704382 > (rcu_read_lock_bh){......}
> [ 22.102656] class_idx:1223 -> chain_key:4e000d0e087044c7 > (dev->qdisc_running_key ?: &qdisc_running_key){+.....}
> [ 22.104095] class_idx:1222 -> chain_key:01a1c10e0898ed06 > (&(&list->lock)->rlock#3){+.-...}


Right, so those hashes do indeed match up pretty nicely.

Now note that this isn't a new issue, just a new warning. Such
collisions were silent before and could/can/do hide actual lock cycles.

In any case, could you perhaps try with commit (from tip/locking/core)
to see if that makes it any better:

dfaaf3fa01d6 ("locking/lockdep: Use __jhash_mix() for iterate_chain_key()")

I still have to consider a better solution to the problem, because
fundamentally collisions are unavoidable, and we should deal better with
them. But our mixing function did suck somewhat, so making collisions
less likely might give us a little more time to ponder.

But if you're really annoyed by it all, disable DEBUG_LOCKDEP for a
while and rest assured that its not more broken than it was before --
just louder :-)