RE: [PATCH] scsi: storvsc: Fix a panic in the hibernation procedure

From: Dexuan Cui
Date: Thu Apr 23 2020 - 03:04:15 EST


> From: Bart Van Assche <bvanassche@xxxxxxx>
> Sent: Wednesday, April 22, 2020 12:56 PM
> On 4/21/20 11:24 PM, Dexuan Cui wrote:
> > Upon suspend, I suppose the other LLDs can not accept I/O either, then
> > what do they do when some kernel thread still tries to submit I/O? Do
> > they "block" the I/O until resume, or just return an error immediately?
>
> This is my understanding of how other SCSI LLDs handle suspend/resume:
> - The ULP driver, e.g. the SCSI sd driver, implements power management
> support by providing callbacks in struct scsi_driver.gendrv.pm and also
> in scsi_bus_type.pm. The SCSI sd suspend callbacks flush the device
> cache and send a STOP command to the device.

It looks the sd suspend callbacks are only for the I/O from the disk, e.g.
from the file system that lives in some partition of some disk.

The panic I'm seeing is not from sd. I think it's from a kernel thread
that tries to detect the status of the SCSI CDROM. This is the snipped
messages (the full version is at https://lkml.org/lkml/2020/4/10/47): here
the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called,
and later the storvsc LLD's suspend callback is also called, but
sr_block_check_events() can still try to submit SCSI commands to storvsc:

[ 11.668741] sr 0:0:0:1: bus quiesce
[ 11.668804] sd 0:0:0:0: bus quiesce
[ 11.698082] scsi target0:0:0: bus quiesce
[ 11.703296] scsi host0: bus quiesce
[ 11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce
[ 11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce
[ 11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090
[ 11.804996] Workqueue: events_freezable_power_ disk_events_workfn
[ 11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc]
[ 11.804996] Call Trace:
[ 11.804996] scsi_queue_rq+0x593/0xa10
[ 11.804996] blk_mq_dispatch_rq_list+0x8d/0x510
[ 11.804996] blk_mq_sched_dispatch_requests+0xed/0x170
[ 11.804996] __blk_mq_run_hw_queue+0x55/0x110
[ 11.804996] __blk_mq_delay_run_hw_queue+0x141/0x160
[ 11.804996] blk_mq_sched_insert_request+0xc3/0x170
[ 11.804996] blk_execute_rq+0x4b/0xa0
[ 11.804996] __scsi_execute+0xeb/0x250
[ 11.804996] sr_check_events+0x9f/0x270 [sr_mod]
[ 11.804996] cdrom_check_events+0x1a/0x30 [cdrom]
[ 11.804996] sr_block_check_events+0xcc/0x110 [sr_mod]
[ 11.804996] disk_check_events+0x68/0x160
[ 11.804996] process_one_work+0x20c/0x3d0
[ 11.804996] worker_thread+0x2d/0x3e0
[ 11.804996] kthread+0x10c/0x130
[ 11.804996] ret_from_fork+0x35/0x40

It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend ->
scsi_device_quiesce() does not guarantee the device is totally quiescent:

/**
* scsi_device_quiesce - Block user issued commands.
* @sdev: scsi device to quiesce.
*
* This works by trying to transition to the SDEV_QUIESCE state
* (which must be a legal transition). When the device is in this
* state, only special requests will be accepted, all others will
* be deferred. Since special requests may also be requeued requests,
* a successful return doesn't guarantee the device will be
* totally quiescent.

I guess the "special requests" may include the "detect the status of the
SCSI CDROM" request from sr_block_check_events().

It looks scsi_device_quiesce() does not freeze the queue and it just puts the
device to the SDEV_QUIESCE state, which is not enough to prevent
sr_block_check_events from submitting SCSI commands.

It looks scsi_host_block() freezes the queue and flushes all the pending I/O,
so it can fix the aforementioned NULL pointer deference panic for me.

> - SCSI LLDs for PCIe devices optionally provide pci_driver.suspend and
> resume callbacks. These callbacks can be used to make the PCIe device
> enter/leave a power saving state. No new SCSI commands should be
> submitted after pci_driver.suspend has been called.
>
> > I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements
> > a mechanism of marking the device busy upon suspend, so any I/O will
> > immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the
> > disadvantage is: the mechanism of marking the device busy looks
> > complex, and the hot path .queuecommand() has to take the
> > spinlock shost->host_lock, which should affect the performance.
>
> I think the code you are referring to is the code in
> scsifront_suspend(). A pointer to that function is stored in a struct
> xenbus_driver instance. That's another structure than the structures
> mentioned above.
>
> Wouldn't it be better to make sure that any hypervisor suspend
> operations happen after it is guaranteed that no further SCSI commands
> will be submitted such that hypervisor suspend operations do not have to
> deal with SCSI commands submitted during or after the hypervisor suspend
> callback?

I agree with you, but it looks scsi_bus_freeze() doesn't guarantee no
further SCSI commands will be submitted, as I described above. Maybe that
is why scsifront_suspend() has to invent a mechanism to cope with the issue.

> > It looks drivers/scsi/nsp32.c: nsp32_suspend() and
> > drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O
> > after suspend. I doubt this is correct.
>
> nsp32_suspend() is a PCI suspend callback. If any SCSI commands would be
> submitted after that callback has started that would mean that the SCSI
> suspend and PCIe suspend operations are called in the wrong order. I do
> not agree that code for suspending SCSI commands should be added in
> nsp32_suspend().
>
> > So it looks to me there is no simple mechanism to handle the scenario
> > here, and I guess that's why the scsi_host_block/unblock APIs are
> > introduced, and actually there is already an user of the APIs:
> > 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O").
> >
> > The aacraid patch says: "This has the advantage that the block layer will
> > stop sending I/O to the adapter instead of having the SCSI midlayer
> > requeueing I/O internally". It looks this may imply that using the new
> > APIs is encouraged?
>
> I'm fine with using these new functions in device reset handlers. Using
> these functions in power management handlers seems wrong to me.

It looks you're suggesting the scsi_host_block() in aac_suspend() should
be removed? :-)

> > PS, here storvsc has to destroy and re-construct the I/O queues: the
> > I/O queues are shared memory ringbufers between the guest and the
> > host; in the resume path of the hibernation procedure, the memory
> > pages allocated by the 'new' kernel is different from that allocated by
> > the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel
> > must destroy the mapping of the pages, and later after we jump to
> > the 'old' kernel, we'll re-create the mapping using the pages allocated
> > by the 'old' kernel. Here "create the mapping" means the guest tells
> > the host about the physical addresses of the pages.
>
> Thank you for having clarified this. This helps me to understand the HV
> driver framework better. I think this means that the hv_driver.suspend
> function should be called at a later time than SCSI suspend. From

I agree, but it looks here scsi_bus_freeze() doesn't work as we expected?

> Documentation/driver-api/device_link.rst: "By default, the driver core
> only enforces dependencies between devices that are borne out of a
> parent/child relationship within the device hierarchy: When suspending,
> resuming or shutting down the system, devices are ordered based on this
> relationship, i.e. children are always suspended before their parent,
> and the parent is always resumed before its children." Is there a single
> storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has

Yes.

> it been considered to make storvsc_drv the parent device of all SCSI
> devices created by the storvsc driver?
>
> Thanks,
>
> Bart.

Yes, I think so:

root@localhost:~# ls -rtl /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver
lrwxrwxrwx 1 root root 0 Apr 22 01:10 /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver -> ../../../../../../../../../../bus/scsi/drivers/sd

Here the driver of /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943
is storvsc, which creates host3/target3:0:0/3:0:0:0.

So it looks there is no ordering issue.

Thanks,
-- Dexuan