Re: [PATCH] block: fix kobject double initialization in add_disk

From: Yu Kuai
Date: Fri Aug 08 2025 - 04:34:31 EST


Hi,

在 2025/08/08 16:09, Nilay Shroff 写道:


On 8/8/25 6:18 AM, Yu Kuai wrote:
Hi,

在 2025/08/07 21:44, Zheng Qixing 写道:
Hi,


在 2025/8/7 19:47, Nilay Shroff 写道:

On 8/7/25 12:50 PM, Zheng Qixing wrote:
From: Zheng Qixing <zhengqixing@xxxxxxxxxx>

Device-mapper can call add_disk() multiple times for the same gendisk
due to its two-phase creation process (dm create + dm load). This leads
to kobject double initialization errors when the underlying iSCSI devices
become temporarily unavailable and then reappear.

However, if the first add_disk() call fails and is retried, the queue_kobj
gets initialized twice, causing:

kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
something is seriously wrong.
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5b/0x80
   kobject_init.cold+0x43/0x51
   blk_register_queue+0x46/0x280
   add_disk_fwnode+0xb5/0x280
   dm_setup_md_queue+0x194/0x1c0
   table_load+0x297/0x2d0
   ctl_ioctl+0x2a2/0x480
   dm_ctl_ioctl+0xe/0x20
   __x64_sys_ioctl+0xc7/0x110
   do_syscall_64+0x72/0x390
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by separating kobject initialization from sysfs registration:
  - Initialize queue_kobj early during gendisk allocation
  - add_disk() only adds the already-initialized kobject to sysfs
  - del_gendisk() removes from sysfs but doesn't destroy the kobject
  - Final cleanup happens when the disk is released

Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
Reported-by: Li Lingfeng <lilingfeng3@xxxxxxxxxx>
Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@xxxxxxxxxx/
Signed-off-by: Zheng Qixing <zhengqixing@xxxxxxxxxx>
---
  block/blk-sysfs.c | 4 +---
  block/blk.h       | 1 +
  block/genhd.c     | 2 ++
  3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 396cded255ea..37d8654faff9 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
      /* nothing to do here, all data is associated with the parent gendisk */
  }
-static const struct kobj_type blk_queue_ktype = {
+const struct kobj_type blk_queue_ktype = {
      .default_groups = blk_queue_attr_groups,
      .sysfs_ops    = &queue_sysfs_ops,
      .release    = blk_queue_release,
@@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
      struct request_queue *q = disk->queue;
      int ret;
-    kobject_init(&disk->queue_kobj, &blk_queue_ktype);
      ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
      if (ret < 0)
          goto out_put_queue_kobj;
If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
where we release/put disk->queue_kobj. That would decrement the kref of
disk->queue_kobj and possibly bring it to zero.


Since we remove the kobject_init() into alloc disk, when the kobject_add() fails here,

it should return without kobject_del/put().

Yes, sorry I didn't noticed.


If kobject_add() succeeds but later steps fail, we should call kobject_del() to rollback.


The current error handling with kobject_put() in blk_register_queue() is indeed problematic.


Next time, when we call add_disk() again without invoking kobject_init()
(because the initialization is now moved outside add_disk()), the refcount
of disk->queue_kobj — which was previously released — would now go for a
toss. Wouldn't that lead to use-after-free or inconsistent state?

@@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
          elevator_set_none(q);
      blk_debugfs_remove(disk);
-    kobject_put(&disk->queue_kobj);
  }
I'm thinking a case where add_disk() fails after the queue is registered.
In that case, we call blk_unregister_queue() — which would ideally put()
the disk->queue_kobj.
But if we skip that put() in blk_unregister_queue() (and that's what we do
above), and then later retry add_disk(), wouldn’t kobject_add() from
blk_register_queue() complain loudly — since we’re trying to add a kobject
that was already added previously?

This is exactly the problem reported orginally, now is the same
procedures before 2bd85221a625:

1) allocate memory: kobject_init
2) register queue: kobject_add
3) unregister queue: kobject_del
4) free memory: kobject_put

Noted that kobject_add is corresponding to kobject_del, and they don't
grab/release kobject reference. 2) and 3) can be executed multiple
times, the only thing that I noticed is the following uevent:

kobject_uevent(&disk->queue_kobj, KOBJ_ADD);

Looks like the uevent is only valid for the first one, if first add_disk
failed and then queue is registered again, there won't be uevent again,
see state_add_uevent_sent.

Why do you think so? We always send the "add" uevent when we register
queue and then send "remove" uevent when the queue is unregistered. This
behavior should remain intact with this change. The kobj->state_add_uevent_sent
is only used for sending "remove" event during automatic cleanup of kobject.
It shouldn't have any side effect while we re-register queue or retry
add_disk(). Or am I missing something here?

Yes, turns out I misread the code after I tested and found that user
didn't get the queue uevent. Just dig deeper and realize this is due
to dev_uevent_filter() from the top devices_kset stop the queue kobj
uevent.

So, if I don't misread the code now, looks like the kobject_uevent()
call to queue_kobj is useless.

Thanks,
Kuai


Thanks,
--Nilay
.