RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

From: Michael Kelley
Date: Tue Mar 09 2021 - 12:10:19 EST


From: John Garry <john.garry@xxxxxxxxxx> Sent: Tuesday, March 9, 2021 8:36 AM
>
> On 09/03/2021 15:57, Michael Kelley wrote:
> > From: John Garry <john.garry@xxxxxxxxxx> Sent: Tuesday, March 9, 2021 2:10 AM
> >>
> >> On 08/03/2021 17:56, Melanie Plageman wrote:
> >>> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:
> >>>> From: Melanie Plageman (Microsoft) <melanieplageman@xxxxxxxxx> Sent: Friday,
> >> March 5, 2021 3:22 PM
> >>>>>
> >>>>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> >>>>> allocation.
> >>>>>
> >>>>> Cap cmd_per_lun at can_queue to avoid dispatch errors.
> >>>>>
> >>>>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@xxxxxxxxx>
> >>>>> ---
> >>>>> drivers/scsi/storvsc_drv.c | 2 ++
> >>>>> 1 file changed, 2 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> >>>>> index 6bc5453cea8a..d7953a6e00e6 100644
> >>>>> --- a/drivers/scsi/storvsc_drv.c
> >>>>> +++ b/drivers/scsi/storvsc_drv.c
> >>>>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
> >>>>> (max_sub_channels + 1) *
> >>>>> (100 - ring_avail_percent_lowater) / 100;
> >>>>>
> >>>>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,
> >> scsi_driver.can_queue);
> >>>>> +
> >>>>
> >>>> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate?
> >>>
> >>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
> >>> Scsi_Host->cmd_per_lun in storvsc_probe().
> >>>
> >>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
> >>> called and sets the scsi_device->queue_depth to the Scsi_Host's
> >>> cmd_per_lun with this code:
> >>>
> >>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
> >>> sdev->host->cmd_per_lun : 1);
> >>>
> >>> During dispatch, the scsi_device->queue_depth is used in
> >>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
> >>> whether or not the device can queue another command.
> >>>
> >>> On some machines, with the 2048 value of cmd_per_lun that was used to
> >>> set the initial scsi_device->queue_depth, commands can be queued that
> >>> are later not able to be dispatched after running out of space in the
> >>> ringbuffer.
> >>>
> >>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
> >>> (running an fio workload that I can provide if needed), storvsc_do_io()
> >>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
> >>>
> >>> This is the call stack:
> >>>
> >>> hv_get_bytes_to_write
> >>> hv_ringbuffer_write
> >>> vmbus_send_packet
> >>> storvsc_dio_io
> >>> storvsc_queuecommand
> >>> scsi_dispatch_cmd
> >>> scsi_queue_rq
> >>> dispatch_rq_list
> >>>
> >>>> Be aware that the calculation of "can_queue" in this driver is somewhat
> >>>> flawed -- it should not be based on the size of the ring buffer, but instead on
> >>>> the maximum number of requests Hyper-V will queue. And even then,
> >>>> can_queue doesn't provide the cap you might expect because the blk-mq layer
> >>>> allocates can_queue tags for each HW queue, not as a total.
> >>>
> >>>
> >>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:
> >>>
> >>> can_queue
> >>> - must be greater than 0; do not send more than can_queue
> >>> commands to the adapter.
> >>>
> >>> I did notice that in scsi_host.h, the comment for can_queue does say
> >>> can_queue is the "maximum number of simultaneous commands a single hw
> >>> queue in HBA will accept." However, I don't see it being used this way
> >>> in the code.
> >>>
> >>
> >> JFYI, the block layer ensures that no more than can_queue requests are
> >> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue
> >> depth is set to shost->can_queue.
> >>
> >> Thanks,
> >> John
> >
> > Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls
> > blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),
> > which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag
> > set queue_depth as needed until it succeeds.
> >
> > The key thing is that __blk_mq_alloc_rq_maps() iterates over the
> > number of HW queues calling __blk_mq_alloc_map_and_request().
> > The latter function allocates the map and the requests with a count
> > of the tag set's queue_depth. There's no logic to apportion the
> > can_queue value across multiple HW queues. So each HW queue gets
> > can_queue tags allocated, and the SCSI host driver may see up to
> > (can_queue * # HW queues) simultaneous requests.
> >
> > I'm certainly not an expert in this area, but that's what I see in the
> > code. We've run live experiments, and can see the number
> > simultaneous requests sent to the storvsc driver be greater than
> > can_queue when the # of HW queues is greater than 1, which seems
> > to be consistent with the code.
>
> ah, ok. I assumed that # of HW queues = 1 here. So you're describing a
> problem similar to
> https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1bb5@xxxxxxxxxx/

Yes.

>
> So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads:
> the total queue depth per host is nr_hw_queues * can_queue. However, for
> when host_tagset is set, the total queue depth is can_queue.
>
> Setting .host_tagset will ensure at most can_queue requests will be sent
> over all HW queues at any given time. A few SCSI MQ drivers set this now.
>

I had seen the "host_tagset" option in some recent investigations, and it
looks like it better models what the storvsc driver wants. But given that
only a few drivers were using it, it seemed like it might be far enough off the
beaten path to be a bit risky.

The storvsc driver is the driver for the synthetic SCSI controller provided
by Microsoft's Hyper-V hypervisor, and as such it is used heavily in the
Azure public cloud. The settings in storvsc have changed incrementally
over the years, and there are now some inconsistencies as Melanie has
pointed out. Storvsc and the underlying Hyper-V run in VMs with relatively
low IOPS limits and also in VMs that can hit hundreds of K IOPS. Getting
enough parallelism to drive the high IOPS while not overloading in the
low IOPS environments is the challenge. As a purely synthetic device,
there aren't really multiple HW queues, but configuring for multiple
queues helps drive the higher end IOPS numbers in VMs with sizable
vCPU counts.

So setting "host_tagset" while allowing multiple HW queues for
higher parallelism might be a good approach. We will experiment.

Michael