Re: (4.3.0) r8152: deadlock related to runtime suspend?

From: Peter Wu
Date: Mon Dec 07 2015 - 04:38:23 EST


On Mon, Dec 07, 2015 at 05:11:50PM +0800, Lu Baolu wrote:
> Hi Peter,
>
> Have you ever tried disabling auto-pm? Did things go smoothly if auto-pm is disabled?
>
> I always disable usb auto-pm in below way.
>
> # echo on | tee /sys/bus/usb/devices/*/power/control
> # echo on > /sys/bus/pci/devices/<bus_name>/power/control
>
> Thanks,
> Baolu

Hi Baolu,

The deadlock does not seem to occur with auto-PM disabled, but that is a
workaround for the issue. The hang can always be reproduced under this
test:

- Start a QEMU VM, passing through the USB adapter
- This VM boots to a busybox shell with no other services running or
udev magic (to reduce interference).
- Enable runtime PM for all devices by default (see script below)
- From the console, invoke "ip link set eth1 up" (eth0 is a virtio
adapter).

# somewhere in /init after mounting filesystems
echo /sbin/hotplug > /proc/sys/kernel/hotplug
echo auto | tee /sys/bus/pci/devices/*/power/control \
/sys/bus/usb/devices/*/power/control >/dev/null

#!/bin/sh
# /sbin/hotplug
path="/sys/$DEVPATH/power/control"
[ -e "$path" ] || return
newval=auto
read status < "$path"
if [ "x$status" != "x$newval" ]; then
echo "$DEVPATH: $status -> $newval" >/dev/kmsg
echo $newval > "$path"
fi

With "auto", the ip command hangs (a trace can be found on the bottom of
this mail). With "on", it does not.

If I keep a loop spinning that invokes `ethtool eth1`, the command
returns immediately without issues (presumably because the device is not
suspended through runtime PM).

Under some circumstances I get a lockdep warning (when trying to bring
an interface down if I remember correctly). Its trace can be found on
the bottom of this mail.

I'll keep testing. For the lockdep warning, my initial guess is that
calling schedule_delayed_work_sync under tp->lock is a bad idea because
scheduled work can execute and try to claim tp->lock too.

Maybe there are two different lockup cases here, I'll keep testing.

Kind regards,
Peter

> On 12/05/2015 06:59 PM, Peter Wu wrote:
> > Hi,
> >
> > I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
> > 0bda:8153), but when I did last night, it resulted in a lockup of
> > processes doing networking ("ip link", "ping", "ethtool", ...).
> >
> > A (few) minute(s) before that event, I noticed that there was no network
> > connectivity (ping hung) which was somehow solved by invoking "ethtool
> > eth1" (triggering runtime pm wakeup?). This same trick did not work at
> > the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
> > and interrupt (^C) did not work at all.
> >
> > Since that did not work, I pulled the USB adapter and re-inserted it,
> > hoping it would reset things. That did not work at all, there was a
> > "usb disconnect" message, but no further driver messages.
> >
> > Fast forward an hour, and it has become a disaster. I have terminated
> > and killed many programs via SysRq but am still unable to get a stable
> > system that does not hang on network I/O. Even the suspend process
> > fails so in the end I attempted to shutdown the system. After half an
> > hour after getting the poweroff message, I issued SysRq + B to reboot
> > (since SysRq + O did not shut down either).
> >
> > Attached are logs with various various backtraces from SysRq and failed
> > suspend. Let me know if you need more information!
> >
> > By the way, often I have to rmmod xhci and re-insert it, otherwise
> > plugging it in does not result in a detection. A USB 2.0 port does not
> > have this problem (runtime PM is enabled for all devices). This is the
> > USB 3.0 port:
> >
> > 02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
> > Host Controller [1033:0194] (rev 03)
>

--

lockdep splat from the bare machine:

======================================================
[ INFO: possible circular locking dependency detected ]
4.3.0-custom #1 Tainted: G O
-------------------------------------------------------
kworker/0:1/38 is trying to acquire lock:
(&tp->control){+.+.+.}, at: [<ffffffffa05d3544>] rtl8152_resume+0x24/0x130 [r8152]

but task is already holding lock:
((&(&tp->schedule)->work)){+.+.+.}, at: [<ffffffff8107819c>] process_one_work+0x15c/0x660

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 ((&(&tp->schedule)->work)){+.+.+.}:

[<ffffffff810b17d3>] lock_acquire+0xc3/0x1d0
[<ffffffff8107751d>] flush_work+0x3d/0x290
[<ffffffff810791ae>] __cancel_work_timer+0xfe/0x1c0
[<ffffffff810792a3>] cancel_delayed_work_sync+0x13/0x20
[<ffffffffa05d0dea>] rtl8152_set_speed+0x2a/0x260 [r8152]
[<ffffffffa05d4a26>] rtl8152_open+0x396/0x4f0 [r8152]
[<ffffffff81567ebf>] __dev_open+0xaf/0x120
[<ffffffff815681ad>] __dev_change_flags+0x9d/0x160
[<ffffffff81568299>] dev_change_flags+0x29/0x70
[<ffffffff8157785a>] do_setlink+0x5ba/0xb00
[<ffffffff81578469>] rtnl_newlink+0x5a9/0x8a0
[<ffffffff81576bb4>] rtnetlink_rcv_msg+0x84/0x210
[<ffffffff81599ab7>] netlink_rcv_skb+0x97/0xb0
[<ffffffff81576b1a>] rtnetlink_rcv+0x2a/0x40
[<ffffffff8159943e>] netlink_unicast+0x1be/0x2d0
[<ffffffff8159983b>] netlink_sendmsg+0x2eb/0x3a0
[<ffffffff81545428>] sock_sendmsg+0x38/0x50
[<ffffffff81545d42>] ___sys_sendmsg+0x292/0x2a0
[<ffffffff81546cc2>] __sys_sendmsg+0x42/0x80
[<ffffffff81546d12>] SyS_sendmsg+0x12/0x20
[<ffffffff816707d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

-> #0 (&tp->control){+.+.+.}:
[<ffffffff810b0ed7>] __lock_acquire+0x1a47/0x1a50
[<ffffffff810b17d3>] lock_acquire+0xc3/0x1d0
[<ffffffff8166c0a7>] mutex_lock_nested+0x67/0x3c0
[<ffffffffa05d3544>] rtl8152_resume+0x24/0x130 [r8152]
[<ffffffff814d735f>] usb_resume_interface.isra.6+0x9f/0x100
[<ffffffff814d75fa>] usb_resume_both+0x6a/0x130
[<ffffffff814d869a>] usb_runtime_resume+0x1a/0x20
[<ffffffff814602c2>] __rpm_callback+0x32/0x70
[<ffffffff8146035f>] rpm_callback+0x5f/0x80
[<ffffffff81460c4b>] rpm_resume+0x4db/0x7d0
[<ffffffff81460b0c>] rpm_resume+0x39c/0x7d0
[<ffffffff8146211e>] __pm_runtime_resume+0x4e/0x80
[<ffffffff814d7112>] usb_autopm_get_interface+0x22/0x60
[<ffffffffa05d4cf0>] rtl_work_func_t+0x40/0x3ee [r8152]
[<ffffffff8107822d>] process_one_work+0x1ed/0x660
[<ffffffff810786ee>] worker_thread+0x4e/0x450
[<ffffffff8107f13f>] kthread+0xef/0x110
[<ffffffff81670b4f>] ret_from_fork+0x3f/0x70

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock((&(&tp->schedule)->work));
lock(&tp->control);
lock((&(&tp->schedule)->work));
lock(&tp->control);

*** DEADLOCK ***

2 locks held by kworker/0:1/38:
#0: ("events"){.+.+.+}, at: [<ffffffff8107819c>] process_one_work+0x15c/0x660
#1: ((&(&tp->schedule)->work)){+.+.+.}, at: [<ffffffff8107819c>] process_one_work+0x15c/0x660

stack backtrace:
CPU: 0 PID: 38 Comm: kworker/0:1 Tainted: G O 4.3.0-custom #1
Hardware name: CLEVO CO. B7130 /B7130 , BIOS 6.00 08/27/2010
Workqueue: events rtl_work_func_t [r8152]
ffffffff8264d7d0 ffff8802321ff950 ffffffff8134cdac ffffffff8264d7d0
ffff8802321ff990 ffffffff810adbb3 ffff8802321ff9e0 ffff8802320e50b8
0000000000000002 0000000000000001 ffff8802320e50e0 ffff8802320e4880
Call Trace:
[<ffffffff8134cdac>] dump_stack+0x4e/0x82
[<ffffffff810adbb3>] print_circular_bug+0x1e3/0x250
[<ffffffff810b0ed7>] __lock_acquire+0x1a47/0x1a50
[<ffffffff810c505d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[<ffffffff810b17d3>] lock_acquire+0xc3/0x1d0
[<ffffffffa05d3544>] ? rtl8152_resume+0x24/0x130 [r8152]
[<ffffffffa05d3544>] ? rtl8152_resume+0x24/0x130 [r8152]
[<ffffffff8166c0a7>] mutex_lock_nested+0x67/0x3c0
[<ffffffffa05d3544>] ? rtl8152_resume+0x24/0x130 [r8152]
[<ffffffff8166d7be>] ? mutex_unlock+0xe/0x10
[<ffffffff814d8680>] ? usb_runtime_suspend+0x70/0x70
[<ffffffffa05d3544>] rtl8152_resume+0x24/0x130 [r8152]
[<ffffffff814d735f>] usb_resume_interface.isra.6+0x9f/0x100
[<ffffffff814d75fa>] usb_resume_both+0x6a/0x130
[<ffffffff814d869a>] usb_runtime_resume+0x1a/0x20
[<ffffffff814602c2>] __rpm_callback+0x32/0x70
[<ffffffff8146035f>] rpm_callback+0x5f/0x80
[<ffffffff814d8680>] ? usb_runtime_suspend+0x70/0x70
[<ffffffff81460c4b>] rpm_resume+0x4db/0x7d0
[<ffffffff81460b0c>] rpm_resume+0x39c/0x7d0
[<ffffffff8146211e>] __pm_runtime_resume+0x4e/0x80
[<ffffffff814d7112>] usb_autopm_get_interface+0x22/0x60
[<ffffffffa05d4cf0>] rtl_work_func_t+0x40/0x3ee [r8152]
[<ffffffff8107822d>] process_one_work+0x1ed/0x660
[<ffffffff8107819c>] ? process_one_work+0x15c/0x660
[<ffffffff810786ee>] worker_thread+0x4e/0x450
[<ffffffff810786a0>] ? process_one_work+0x660/0x660
[<ffffffff8107f13f>] kthread+0xef/0x110
[<ffffffff8107f050>] ? kthread_create_on_node+0x200/0x200
[<ffffffff81670b4f>] ret_from_fork+0x3f/0x70
[<ffffffff8107f050>] ? kthread_create_on_node+0x200/0x200

Hung task message in the VM after "ip link set eth1 up"

INFO: task kworker/1:1:27 blocked for more than 60 seconds.
Not tainted 4.4.0-rc3-rtlwifi+ #9
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/1:1 D ffff88000de8bca0 0 27 2 0x00000000
Workqueue: events linkwatch_event
ffff88000de8bca0 ffff88000de78808 000000000de8bc98 ffff88000dcd0000
ffff88000de78000 ffff88000de8c000 0000000000000246 ffffffff81ceb428
ffff88000de78000 00000000ffffffff ffff88000de8bcb8 ffffffff8173a23c
Call Trace:
[<ffffffff8173a23c>] schedule+0x3c/0x90
[<ffffffff8173a2a5>] schedule_preempt_disabled+0x15/0x20
[<ffffffff8173b9dc>] mutex_lock_nested+0x16c/0x390
[<ffffffff815c9cd7>] ? rtnl_lock+0x17/0x20
[<ffffffff815c9cd7>] ? rtnl_lock+0x17/0x20
[<ffffffff815c9cd7>] rtnl_lock+0x17/0x20
[<ffffffff815cd80e>] linkwatch_event+0xe/0x30
[<ffffffff810c22f7>] process_one_work+0x1f7/0x640
[<ffffffff810c226c>] ? process_one_work+0x16c/0x640
[<ffffffff810c278e>] worker_thread+0x4e/0x450
[<ffffffff810c2740>] ? process_one_work+0x640/0x640
[<ffffffff810c2740>] ? process_one_work+0x640/0x640
[<ffffffff810c97c2>] kthread+0xf2/0x110
[<ffffffff810c96d0>] ? __kthread_parkme+0x90/0x90
[<ffffffff81740bbf>] ret_from_fork+0x3f/0x70
[<ffffffff810c96d0>] ? __kthread_parkme+0x90/0x90
3 locks held by kworker/1:1/27:
#0: ("events"){.+.+.+}, at: [<ffffffff810c226c>] process_one_work+0x16c/0x640
#1: ((linkwatch_work).work){+.+...}, at: [<ffffffff810c226c>] process_one_work+0x16c/0x640
#2: (rtnl_mutex){+.+.+.}, at: [<ffffffff815c9cd7>] rtnl_lock+0x17/0x20
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/