[RFC] block: reject I/O in BLKRRPART if block size changed

From: Minwoo Im
Date: Sat Dec 26 2020 - 13:04:24 EST


Background:
Let's say we have 2 LBA format for 4096B and 512B LBA size for a
NVMe namespace. Assume that current LBA format is 4096B and in case
we convert namespace to 512B and 4096B back again:

nvme format /dev/nvme0n1 --lbaf=1 --force # to 512B LBA
nvme format /dev/nvme0n1 --lbaf=0 --force # to 4096B LBA

Then we can see the following errors during the BLKRRPART ioctl from
the nvme-cli format subcommand:

[ 10.771740] blk_update_request: operation not supported error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 10.780262] Buffer I/O error on dev nvme0n1, logical block 0, async page read
...

Also, we can see the Read commands followed by the Format command due
to BLKRRPART ioctl with Number of LBAs to 65535(0xffff) which is
under-flowed because the request for the Read commands are coming with
512B and this is because it's playing around with i_blkbits from the
block_device inode which needs to be avoided as [1].

kworker/0:1H-56 [000] .... 913.456922: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=216, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=65535, ctrl=0x0, dsmgmt=0, reftag=0)
ksoftirqd/0-9 [000] .Ns. 916.566351: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=216, res=0x0, retries=0, flags=0x0, status=0x4002
...

Before we have commit 5ff9f19231a0 ("block: simplify
set_init_blocksize"), block size used to be bumped up to the
4K(PAGE_SIZE) in this example and we have not seen these errors. But
with this patch, we have to make sure that bdev->bd_inode->i_blkbits to
make sure that BLKRRPART ioctl pass proper request length based on the
changed logical block size.

Description:
As the previous discussion [1], this patch introduced a gendisk flag
to indicate that block size has been changed in the runtime. This flag
is set when logical block size is changed in the runtime with sector
capacity itself. It will be cleared when the file descriptor for the
block devie is opened again which means __blkdev_get() updates the block
size via set_init_blocksize().
This patch rejects I/O from the path of add_partitions() and
application should open the file descriptor again to update the block
size of the block device inode.

[1] https://lore.kernel.org/linux-nvme/20201223183143.GB13354@localhost.localdomain/T/#t

Signed-off-by: Minwoo Im <minwoo.im.dev@xxxxxxxxx>
---
block/genhd.c | 3 +++
block/partitions/core.c | 11 +++++++++++
fs/block_dev.c | 6 ++++++
include/linux/genhd.h | 1 +
4 files changed, 21 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index b84b8671e627..1f64907fac3d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -79,6 +79,9 @@ bool set_capacity_and_notify(struct gendisk *disk, sector_t size)
*/
if (!capacity || !size)
return false;
+
+ disk->flags |= GENHD_FL_BLOCK_SIZE_CHANGED;
+
kobject_uevent_env(&disk_to_dev(disk)->kobj, KOBJ_CHANGE, envp);
return true;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index deca253583bd..7dfcda96be9e 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -617,6 +617,17 @@ int blk_add_partitions(struct gendisk *disk, struct block_device *bdev)
if (!disk_part_scan_enabled(disk))
return 0;

+ /*
+ * Reject to check partition information if block size has been changed
+ * in the runtime. If block size of a block device has been changed,
+ * the file descriptor should be opened agian to update the blkbits.
+ */
+ if (disk->flags & GENHD_FL_BLOCK_SIZE_CHANGED) {
+ pr_warn("%s: rejecting checking partition. fd should be opened again.\n",
+ disk->disk_name);
+ return -EBADFD;
+ }
+
state = check_partition(disk, bdev);
if (!state)
return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e56ee1f2652..813361ad77c1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -132,6 +132,12 @@ EXPORT_SYMBOL(truncate_bdev_range);
static void set_init_blocksize(struct block_device *bdev)
{
bdev->bd_inode->i_blkbits = blksize_bits(bdev_logical_block_size(bdev));
+
+ /*
+ * Allow I/O commands for this block device. We can say that this
+ * block device has been set to a proper block size.
+ */
+ bdev->bd_disk->flags &= ~GENHD_FL_BLOCK_SIZE_CHANGED;
}

int set_blocksize(struct block_device *bdev, int size)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 809aaa32d53c..0e0e24917003 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -103,6 +103,7 @@ struct partition_meta_info {
#define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE 0x0100
#define GENHD_FL_NO_PART_SCAN 0x0200
#define GENHD_FL_HIDDEN 0x0400
+#define GENHD_FL_BLOCK_SIZE_CHANGED 0x0800

enum {
DISK_EVENT_MEDIA_CHANGE = 1 << 0, /* media changed */
--
2.17.1