RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

From: Michael Kelley
Date: Tue Mar 09 2021 - 10:58:31 EST


From: John Garry <john.garry@xxxxxxxxxx> Sent: Tuesday, March 9, 2021 2:10 AM
>
> On 08/03/2021 17:56, Melanie Plageman wrote:
> > On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:
> >> From: Melanie Plageman (Microsoft) <melanieplageman@xxxxxxxxx> Sent: Friday,
> March 5, 2021 3:22 PM
> >>>
> >>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> >>> allocation.
> >>>
> >>> Cap cmd_per_lun at can_queue to avoid dispatch errors.
> >>>
> >>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@xxxxxxxxx>
> >>> ---
> >>> drivers/scsi/storvsc_drv.c | 2 ++
> >>> 1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> >>> index 6bc5453cea8a..d7953a6e00e6 100644
> >>> --- a/drivers/scsi/storvsc_drv.c
> >>> +++ b/drivers/scsi/storvsc_drv.c
> >>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
> >>> (max_sub_channels + 1) *
> >>> (100 - ring_avail_percent_lowater) / 100;
> >>>
> >>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,
> scsi_driver.can_queue);
> >>> +
> >>
> >> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate?
> >
> > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
> > Scsi_Host->cmd_per_lun in storvsc_probe().
> >
> > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
> > called and sets the scsi_device->queue_depth to the Scsi_Host's
> > cmd_per_lun with this code:
> >
> > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
> > sdev->host->cmd_per_lun : 1);
> >
> > During dispatch, the scsi_device->queue_depth is used in
> > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
> > whether or not the device can queue another command.
> >
> > On some machines, with the 2048 value of cmd_per_lun that was used to
> > set the initial scsi_device->queue_depth, commands can be queued that
> > are later not able to be dispatched after running out of space in the
> > ringbuffer.
> >
> > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
> > (running an fio workload that I can provide if needed), storvsc_do_io()
> > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
> >
> > This is the call stack:
> >
> > hv_get_bytes_to_write
> > hv_ringbuffer_write
> > vmbus_send_packet
> > storvsc_dio_io
> > storvsc_queuecommand
> > scsi_dispatch_cmd
> > scsi_queue_rq
> > dispatch_rq_list
> >
> >> Be aware that the calculation of "can_queue" in this driver is somewhat
> >> flawed -- it should not be based on the size of the ring buffer, but instead on
> >> the maximum number of requests Hyper-V will queue. And even then,
> >> can_queue doesn't provide the cap you might expect because the blk-mq layer
> >> allocates can_queue tags for each HW queue, not as a total.
> >
> >
> > The docs for scsi_mid_low_api document Scsi_Host can_queue this way:
> >
> > can_queue
> > - must be greater than 0; do not send more than can_queue
> > commands to the adapter.
> >
> > I did notice that in scsi_host.h, the comment for can_queue does say
> > can_queue is the "maximum number of simultaneous commands a single hw
> > queue in HBA will accept." However, I don't see it being used this way
> > in the code.
> >
>
> JFYI, the block layer ensures that no more than can_queue requests are
> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue
> depth is set to shost->can_queue.
>
> Thanks,
> John

Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls
blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),
which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag
set queue_depth as needed until it succeeds.

The key thing is that __blk_mq_alloc_rq_maps() iterates over the
number of HW queues calling __blk_mq_alloc_map_and_request().
The latter function allocates the map and the requests with a count
of the tag set's queue_depth. There's no logic to apportion the
can_queue value across multiple HW queues. So each HW queue gets
can_queue tags allocated, and the SCSI host driver may see up to
(can_queue * # HW queues) simultaneous requests.

I'm certainly not an expert in this area, but that's what I see in the
code. We've run live experiments, and can see the number
simultaneous requests sent to the storvsc driver be greater than
can_queue when the # of HW queues is greater than 1, which seems
to be consistent with the code.

Michael

>
>
> > During dispatch, In scsi_target_queue_ready(), there is this code:
> >
> > if (busy >= starget->can_queue)
> > goto starved;
> >
> > And the scsi_target->can_queue value should be coming from Scsi_host as
> > mentioned in the scsi_target definition in scsi_device.h
> > /*
> > * LLDs should set this in the slave_alloc host template callout.
> > * If set to zero then there is not limit.
> > */
> > unsigned int can_queue;
> >
> > So, I don't really see how this would be per hardware queue.
> >
> >>
> >> I agree that the cmd_per_lun setting is also too big, but we should fix that in
> >> the context of getting all of these different settings working together correctly,
> >> and not piecemeal.
> >>
> >
> > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe
> > will also prevent the LUN queue_depth from being set to a value that is
> > higher than it can ever be set to again by the user when
> > storvsc_change_queue_depth() is invoked.
> >
> > Also in scsi_sysfs sdev_store_queue_depth() there is this check:
> >
> > if (depth < 1 || depth > sdev->host->can_queue)
> > return -EINVAL;
> >
> > I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun
> > is set to the min of the configured cmd_per_lun and
> > Scsi_Host->can_queue:
> >
> > shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
> >
> > Best,
> > Melanie
> > .
> >