RE: [PATCH] usb: uas: fix usb subsystem hang after power off hub port

From: Kento.A.Kobayashi
Date: Wed Apr 03 2019 - 23:57:40 EST


Hi,

>> Root Cause
>> - Block layer timeout happens after power off UAS USB device which is accessed as reproduce step. During timeout error handler process, scsi host state becomes SHOST_CANCEL_RECOVERY that causes IO hangs up and lock cannot be released. And in final, usb subsystem hangs up.
>> Follow is function call:
>> blk_mq_timeout_work
>> â->scsi_times_out (â means some functions are not listed before this function.)
>> â-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY)
>> â -> scsi_error_handler
>> â-> uas_eh_device_reset_handler
>> -> usb_lock_device_for_reset <- take lock
>> -> usb_reset_device
>> â-> rebind = uas_post_reset (return 1 since ENODEV)
>> â-> usb_unbind_and_rebind_marked_interfaces (rebind=1)
>> â-> uas_disconnect (scsi_host_set_state to SHOST_CANCEL_RECOVERY)
>> â -> scsi_queue_rq
>
>How does scsi_queue_rq get called here? As far as I can see, this shouldn't happen.

We confirmed the function call path on linux 4.9 when this problem occured since we are working on it. In linux 4.9, the last function is scsi_request_fn instead of scsi_queue_rq. In staging.git, we think the scsi_queue_rq is called by follow path.
uas_disconnect
|- scsi_remove_host
|- scsi_forget_host
|- __scsi_remove_device
|- device_del
|- bus_remove_device
|- device_release_driver
|- device_release_driver_internal
|- __device_release_driver
|- drv->remove(dev) (sd_remove)
|- sd_shutdown
|- sd_sync_cache
|- scsi_execute
|- __scsi_execute
|- blk_execute_rq
|- blk_execute_rq_nowait
|- blk_mq_sched_insert_request
|- blk_mq_run_hw_queue
|- __blk_mq_delay_run_hw_queue
|- __blk_mq_run_hw_queue
|- blk_mq_sched_dispatch_requests
|- blk_mq_dispatch_rq_list
|- q->mq_ops->queue_rq (scsi_queue_rq)

>> Countermeasure
>> - Make uas_post_reset doesnât return 1 when ENODEV returns from uas_configure_endpoints since usb_unbind_and_rebind_marded_interfaces doesnât need to do unbind/rebind operations in this situation.
>> blk_mq_timeout_work
>> â->scsi_times_out (â means some functions are not listed before this function.)
>> â-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY)
>> â -> scsi_error_handler
>> â-> uas_eh_device_reset_handler (*1)
>> -> usb_lock_device_for_reset <- take lock
>> -> usb_reset_device
>> -> usb_reset_and_verify_device (return ENODEV and FAILED will be reported to *1)
>> -> uas_post_reset returns 0 when ENODEV => rebind=0
>> -> usb_unbind_and_rebind_marked_interfaces (rebind=0)
>
>The difference is that uas_disconnect wasn't called here. But that routine should not cause any problems -- you're always supposed to be able to unbind a driver from a device. So it looks like this is not the right way to solve the problem.

We confirmed usb_driver_release_interface will call usb_unbind_interface when this problem occurs.
So usb_unbind_interface will call driver disconnect callbak.

Regards,
Kento Kobayashi