Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

From: Wenchao Hao
Date: Wed Apr 06 2022 - 09:42:11 EST


On 2022/4/4 13:28, Hannes Reinecke wrote:
> On 4/3/22 19:17, Mike Christie wrote:
>> On 4/3/22 12:14 PM, Mike Christie wrote:
>>> We could share code with scsi_ioctl_reset as well. Drivers that support
>>> TMFs via that ioctl already expect queuecommand to be possibly in the
>>> middle of a run and IO not yet timed out. For example, the code to
>>> block a queue and reset the device could be used for the new EH and
>>> SG_SCSI_RESET_DEVICE handling.
>>>
>>
>> Hannes or others,
>>
>> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
>> supported and more only used for controlled testing?
>
> That's actually a problem in scsi_ioctl_reset(); it really should wait for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls into the various reset functions.
>
> But really, I'd rather get my EH rework in before we're start discussing modifying EH behaviour.
> Let me repost it ...
>
> Cheers,
>
> Hannes

Hi hannes:

According to the statistic, following scenario would cause an abort failed can be handled by LUN reset:
1. The task execute of disk's FW is abnormal;
2. Intermittent bit errors or intermittent disconnection;
3. FW do not response IO;

Following scenario can not be handled by LUN reset:
1. Disk HW issue, LUN reset can not be handled;
2. DDR UNC in disk, can not fix, the only way is to power off then power on
3. FW of disk is out of service, can not fix, the only way is to power off then power on

And the statistic shows most command abort failed can be handled by LUN reset.

So we plan to design a lightweight timeout handle flow as following:

if disable lightweight EH(default)
scsi_times_out ====================================> origin EH flow
||
|| if enable lightweight EH
||
\/
do not using current timeout flow, and branch to another flow which perform following steps:

abort command
||
|| failed
||
\/
stop single LUN's I/O (need to wait LUN's failed command number equal to busy command number)
||
|| failed (according to our statistic, 90% reset LUN would succeed)
||
\/
reset single LUN
||
|| if host with multi LUNs timeout
|| failed =====================================> perform Host reset
|| ||
|| || failed
|| ||
|| <=================================================//
||
\/
offline disk

Since it's a lightweight EH, we prefer offline disk once reset LUN failed.
These changes would not affect origin EH flow. The advantage of this design is it would not affect
other LUNs of same host.