Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same

From: Shaun Tancheff
Date: Mon Aug 22 2016 - 20:23:07 EST


On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal <damien.lemoal@xxxxxxxx> wrote:
>
> Shaun,
>
> On 8/22/16 13:31, Shaun Tancheff wrote:
> [...]
>> -int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>> - sector_t sector, unsigned int num_sectors)
>> +int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
>> {
>> - struct blk_zone *zone;
>> + struct request *rq = cmd->request;
>> + struct scsi_device *sdp = cmd->device;
>> + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
>> + sector_t sector = blk_rq_pos(rq);
>> + unsigned int nr_sectors = blk_rq_sectors(rq);
>> int ret = BLKPREP_OK;
>> + struct blk_zone *zone;
>> unsigned long flags;
>> + u32 wp_offset;
>> + bool use_write_same = false;
>>
>> zone = blk_lookup_zone(rq->q, sector);
>> - if (!zone)
>> + if (!zone) {
>> + /* Test for a runt zone before giving up */
>> + if (sdp->type != TYPE_ZBC) {
>> + struct request_queue *q = rq->q;
>> + struct rb_node *node;
>> +
>> + node = rb_last(&q->zones);
>> + if (node)
>> + zone = rb_entry(node, struct blk_zone, node);
>> + if (zone) {
>> + spin_lock_irqsave(&zone->lock, flags);
>> + if ((zone->start + zone->len) <= sector)
>> + goto out;
>> + spin_unlock_irqrestore(&zone->lock, flags);
>> + zone = NULL;
>> + }
>> + }
>> return BLKPREP_KILL;
>> + }
>
> I do not understand the point of this code here to test for the runt
> zone. As long as sector is within the device maximum capacity (in 512B
> unit), blk_lookup_zone will return the pointer to the zone structure
> containing that sector (the RB-tree does not have any constraint
> regarding zone size). The only case where NULL would be returned is if
> discard is issued super early right after the disk is probed and before
> the zone refresh work has completed. We can certainly protect against
> that by delaying the discard.

As you can see I am not including Host Managed in the
runt check.

Also you may note that in my patch to get Host Aware working
with the zone cache I do not include the runt zone in the cache.
So as it sits I need this fallback otherwise doing blkdiscard over
the whole device ends in a error, as well as mkfs.f2fs et. al.

>> spin_lock_irqsave(&zone->lock, flags);
>> -
>> if (zone->state == BLK_ZONE_UNKNOWN ||
>> zone->state == BLK_ZONE_BUSY) {
>> sd_zbc_debug_ratelimit(sdkp,
>> - "Discarding zone %zu state %x, deferring\n",
>> + "Discarding zone %zx state %x, deferring\n",
>
> Sector values are usually displayed in decimal. Why use Hex here ? At
> least "0x" would be needed to avoid confusion I think.

Yeah, my brain is lazy about converting very large
numbers to powers of 2. So it's much easier to spot
zone alignment here.



>> zone->start, zone->state);
>> ret = BLKPREP_DEFER;
>> goto out;
>> @@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>> if (zone->state == BLK_ZONE_OFFLINE) {
>> /* let the drive fail the command */
>> sd_zbc_debug_ratelimit(sdkp,
>> - "Discarding offline zone %zu\n",
>> + "Discarding offline zone %zx\n",
>> zone->start);
>> goto out;
>> }
>> -
>> - if (!blk_zone_is_smr(zone)) {
>> + if (blk_zone_is_cmr(zone)) {
>> + use_write_same = true;
>> sd_zbc_debug_ratelimit(sdkp,
>> - "Discarding %s zone %zu\n",
>> - blk_zone_is_cmr(zone) ? "CMR" : "unknown",
>> + "Discarding CMR zone %zx\n",
>> zone->start);
>> - ret = BLKPREP_DONE;
>> goto out;
>> }
>
> Some 10TB host managed disks out there have 1% conventional zone space,
> that is 100GB of capacity. When issuing a "reset all", doing a write
> same in these zones will take forever... If the user really wants zeroes
> in those zones, let it issue a zeroout.
>
> I think that it would a better choice to simply not report
> discard_zeroes_data as true and do nothing for conventional zones reset.

I think that would be unfortunate for Host Managed but I think it's
the right choice for Host Aware at this time. So either we base
it on disk type or we have some other config flag added to sysfs.

>> - if (blk_zone_is_empty(zone)) {
>> - sd_zbc_debug_ratelimit(sdkp,
>> - "Discarding empty zone %zu\n",
>> - zone->start);
>> - ret = BLKPREP_DONE;
>> + if (zone->start != sector || zone->len < nr_sectors) {
>> + sd_printk(KERN_ERR, sdkp,
>> + "Misaligned RESET WP %zx/%x on zone %zx/%zx\n",
>> + sector, nr_sectors, zone->start, zone->len);
>> + ret = BLKPREP_KILL;
>> goto out;
>> }
>> -
>> - if (zone->start != sector ||
>> - zone->len < num_sectors) {
>> + /* Protect against Reset WP when more data had been written to the
>> + * zone than is being discarded.
>> + */
>> + wp_offset = zone->wp - zone->start;
>> + if (wp_offset > nr_sectors) {
>> sd_printk(KERN_ERR, sdkp,
>> - "Misaligned RESET WP, start %zu/%zu "
>> - "len %zu/%u\n",
>> - zone->start, sector, zone->len, num_sectors);
>> + "Will Corrupt RESET WP %zx/%x/%x on zone %zx/%zx/%zx\n",
>> + sector, wp_offset, nr_sectors,
>> + zone->start, zone->wp, zone->len);
>> ret = BLKPREP_KILL;
>> goto out;
>> }

---
Shaun Tancheff