Re: [PATCH v13 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes

From: Mike Snitzer
Date: Wed Sep 14 2022 - 12:53:06 EST


On Mon, Sep 12 2022 at 4:22P -0400,
Pankaj Raghav <p.raghav@xxxxxxxxxxx> wrote:

> Only zoned devices with power-of-2(po2) number of sectors per zone(zone
> size) were supported in linux but now non power-of-2(npo2) zone sizes
> support has been added to the block layer.
>
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zone
> sizes, it was suggested to create a dm target for npo2 zone size device to
> appear as a po2 zone size target so that file systems can initially
> work without any explicit changes.
>
> The design of this target is very simple: remap the device zone size to
> the zone capacity and change the zone size to be the nearest power of 2
> value.
>
> For e.g., a device with a zone size/capacity of 3M will have an equivalent
> target layout as follows:
>
> Device layout :-
> zone capacity = 3M
> zone size = 3M
>
> |--------------|-------------|
> 0 3M 6M
>
> Target layout :-
> zone capacity=3M
> zone size = 4M
>
> |--------------|---|--------------|---|
> 0 3M 4M 7M 8M
>
> The area between target's zone capacity and zone size will be emulated
> in the target.
> The read IOs that fall in the emulated gap area will return 0 filled
> bio and all the other IOs in that area will result in an error.
> If a read IO span across the emulated area boundary, then the IOs are
> split across them. All other IO operations that span across the emulated
> area boundary will result in an error.
>
> The target can be easily created as follows:
> dmsetup create <label> --table '0 <size_sects> po2zoned /dev/nvme<id>'
>
> The target does not support partial mapping of the underlying
> device as there is no use-case for it.
>
> Note:
> This target is not related to dm-zoned target, which exposes a zoned block
> device as a regular block device without any write constraint.
>
> This target only exposes a different zone size than the underlying device.
> The underlying device's other constraints will be directly exposed to the
> target.
>
> Signed-off-by: Pankaj Raghav <p.raghav@xxxxxxxxxxx>
> Suggested-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
> Suggested-by: Damien Le Moal <damien.lemoal@xxxxxxx>
> Suggested-by: Hannes Reinecke <hare@xxxxxxx>
> ---

<snip>

> diff --git a/drivers/md/dm-po2zoned-target.c b/drivers/md/dm-po2zoned-target.c
> new file mode 100644
> index 000000000000..a48955faa978
> --- /dev/null
> +++ b/drivers/md/dm-po2zoned-target.c

<snip>

> +/*
> + * This target works on the complete zoned device. Partial mapping is not
> + * supported.
> + * Construct a zoned po2 logical device: <dev-path>
> + */
> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> + struct dm_po2z_target *dmh = NULL;
> + int ret;
> + sector_t zone_size;
> + sector_t dev_capacity;
> +
> + if (argc != 1)
> + return -EINVAL;
> +
> + dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
> + if (!dmh)
> + return -ENOMEM;
> +
> + ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> + &dmh->dev);
> + if (ret) {
> + ti->error = "Device lookup failed";
> + kfree(dmh);
> + return ret;
> + }
> +
> + if (!bdev_is_zoned(dmh->dev->bdev)) {
> + DMERR("%pg is not a zoned device", dmh->dev->bdev);
> + kfree(dmh);
> + return -EINVAL;
> + }
> +
> + zone_size = bdev_zone_sectors(dmh->dev->bdev);
> + dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
> + if (ti->len != dev_capacity) {
> + DMERR("%pg Partial mapping of the target is not supported",
> + dmh->dev->bdev);
> + kfree(dmh);
> + return -EINVAL;
> + }
> +
> + if (is_power_of_2(zone_size))
> + DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone",
> + dmh->dev->bdev);
> +
> + dmh->zone_size = zone_size;
> + dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
> + dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2);
> + dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
> + ti->private = dmh;
> + ti->max_io_len = dmh->zone_size_po2;
> + dmh->nr_zones = npo2_zone_no(dmh, ti->len);
> + ti->len = dmh->zone_size_po2 * dmh->nr_zones;
> +
> + return 0;
> +}

The above error paths need to unwind the references or any other
resources acquired before failing. Please see other targets for how
they handle sequencing of the needed operations (e.g. dm_put_device)
in the error path by using gotos, etc.

> +
> +static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
> + void *data)
> +{
> + struct dm_report_zones_args *args = data;
> + struct dm_target *ti = args->tgt;
> + struct dm_po2z_target *dmh = ti->private;
> +
> + zone->start = device_to_target_sect(ti, zone->start);
> + zone->wp = device_to_target_sect(ti, zone->wp);
> + zone->len = dmh->zone_size_po2;
> + args->next_sector = zone->start + zone->len;
> +
> + return args->orig_cb(zone, args->zone_idx++, args->orig_data);
> +}
> +
> +static int dm_po2z_report_zones(struct dm_target *ti,
> + struct dm_report_zones_args *args,
> + unsigned int nr_zones)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> + sector_t sect =
> + po2_zone_no(dmh, dm_target_offset(ti, args->next_sector)) *
> + dmh->zone_size;
> +
> + return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
> + dm_po2z_report_zones_cb, args);
> +}
> +
> +static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
> + blk_status_t *error)
> +{
> + if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
> + bio->bi_iter.bi_sector =
> + device_to_target_sect(ti, bio->bi_iter.bi_sector);
> +
> + return DM_ENDIO_DONE;
> +}
> +
> +static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> +
> + limits->chunk_sectors = dmh->zone_size_po2;
> +}

Are you certain you shouldn't at least be exposing a different
logical_block_size to upper layers?

> +
> +static void dm_po2z_status(struct dm_target *ti, status_type_t type,
> + unsigned int status_flags, char *result,
> + unsigned int maxlen)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> + size_t sz = 0;
> +
> + switch (type) {
> + case STATUSTYPE_INFO:
> + DMEMIT("%s %lld", dmh->dev->name,
> + (unsigned long long)dmh->zone_size_po2);
> + break;

Wouldn't it be worthwhile to expose the zone sectors (native npo2 vs
simulated po2?) You merely roundup but never expose what you're using
(unless I'm missing something about generic "zoned" device
capabilities).

Mike