Re: [PATCH] block: fix kobject double initialization in add_disk
From: Nilay Shroff
Date: Fri Aug 08 2025 - 05:43:18 EST
On 8/8/25 2:04 PM, Yu Kuai wrote:
> Hi,
>
> 在 2025/08/08 16:09, Nilay Shroff 写道:
>>
>>
>> On 8/8/25 6:18 AM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/08/07 21:44, Zheng Qixing 写道:
>>>> Hi,
>>>>
>>>>
>>>> 在 2025/8/7 19:47, Nilay Shroff 写道:
>>>>>
>>>>> On 8/7/25 12:50 PM, Zheng Qixing wrote:
>>>>>> From: Zheng Qixing <zhengqixing@xxxxxxxxxx>
>>>>>>
>>>>>> Device-mapper can call add_disk() multiple times for the same gendisk
>>>>>> due to its two-phase creation process (dm create + dm load). This leads
>>>>>> to kobject double initialization errors when the underlying iSCSI devices
>>>>>> become temporarily unavailable and then reappear.
>>>>>>
>>>>>> However, if the first add_disk() call fails and is retried, the queue_kobj
>>>>>> gets initialized twice, causing:
>>>>>>
>>>>>> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
>>>>>> something is seriously wrong.
>>>>>> Call Trace:
>>>>>> <TASK>
>>>>>> dump_stack_lvl+0x5b/0x80
>>>>>> kobject_init.cold+0x43/0x51
>>>>>> blk_register_queue+0x46/0x280
>>>>>> add_disk_fwnode+0xb5/0x280
>>>>>> dm_setup_md_queue+0x194/0x1c0
>>>>>> table_load+0x297/0x2d0
>>>>>> ctl_ioctl+0x2a2/0x480
>>>>>> dm_ctl_ioctl+0xe/0x20
>>>>>> __x64_sys_ioctl+0xc7/0x110
>>>>>> do_syscall_64+0x72/0x390
>>>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>>
>>>>>> Fix this by separating kobject initialization from sysfs registration:
>>>>>> - Initialize queue_kobj early during gendisk allocation
>>>>>> - add_disk() only adds the already-initialized kobject to sysfs
>>>>>> - del_gendisk() removes from sysfs but doesn't destroy the kobject
>>>>>> - Final cleanup happens when the disk is released
>>>>>>
>>>>>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
>>>>>> Reported-by: Li Lingfeng <lilingfeng3@xxxxxxxxxx>
>>>>>> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@xxxxxxxxxx/
>>>>>> Signed-off-by: Zheng Qixing <zhengqixing@xxxxxxxxxx>
>>>>>> ---
>>>>>> block/blk-sysfs.c | 4 +---
>>>>>> block/blk.h | 1 +
>>>>>> block/genhd.c | 2 ++
>>>>>> 3 files changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>> index 396cded255ea..37d8654faff9 100644
>>>>>> --- a/block/blk-sysfs.c
>>>>>> +++ b/block/blk-sysfs.c
>>>>>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>>>>>> /* nothing to do here, all data is associated with the parent gendisk */
>>>>>> }
>>>>>> -static const struct kobj_type blk_queue_ktype = {
>>>>>> +const struct kobj_type blk_queue_ktype = {
>>>>>> .default_groups = blk_queue_attr_groups,
>>>>>> .sysfs_ops = &queue_sysfs_ops,
>>>>>> .release = blk_queue_release,
>>>>>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>>>>>> struct request_queue *q = disk->queue;
>>>>>> int ret;
>>>>>> - kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>>>>>> ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
>>>>>> if (ret < 0)
>>>>>> goto out_put_queue_kobj;
>>>>> If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
>>>>> where we release/put disk->queue_kobj. That would decrement the kref of
>>>>> disk->queue_kobj and possibly bring it to zero.
>>>>
>>>>
>>>> Since we remove the kobject_init() into alloc disk, when the kobject_add() fails here,
>>>>
>>>> it should return without kobject_del/put().
>>>
>>> Yes, sorry I didn't noticed.
>>>>
>>>>
>>>> If kobject_add() succeeds but later steps fail, we should call kobject_del() to rollback.
>>>>
>>>>
>>>> The current error handling with kobject_put() in blk_register_queue() is indeed problematic.
>>>>
>>>>
>>>>> Next time, when we call add_disk() again without invoking kobject_init()
>>>>> (because the initialization is now moved outside add_disk()), the refcount
>>>>> of disk->queue_kobj — which was previously released — would now go for a
>>>>> toss. Wouldn't that lead to use-after-free or inconsistent state?
>>>>>
>>>>>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>>>>>> elevator_set_none(q);
>>>>>> blk_debugfs_remove(disk);
>>>>>> - kobject_put(&disk->queue_kobj);
>>>>>> }
>>>>> I'm thinking a case where add_disk() fails after the queue is registered.
>>>>> In that case, we call blk_unregister_queue() — which would ideally put()
>>>>> the disk->queue_kobj.
>>>>> But if we skip that put() in blk_unregister_queue() (and that's what we do
>>>>> above), and then later retry add_disk(), wouldn’t kobject_add() from
>>>>> blk_register_queue() complain loudly — since we’re trying to add a kobject
>>>>> that was already added previously?
>>>
>>> This is exactly the problem reported orginally, now is the same
>>> procedures before 2bd85221a625:
>>>
>>> 1) allocate memory: kobject_init
>>> 2) register queue: kobject_add
>>> 3) unregister queue: kobject_del
>>> 4) free memory: kobject_put
>>>
>>> Noted that kobject_add is corresponding to kobject_del, and they don't
>>> grab/release kobject reference. 2) and 3) can be executed multiple
>>> times, the only thing that I noticed is the following uevent:
>>>
>>> kobject_uevent(&disk->queue_kobj, KOBJ_ADD);
>>>
>>> Looks like the uevent is only valid for the first one, if first add_disk
>>> failed and then queue is registered again, there won't be uevent again,
>>> see state_add_uevent_sent.
>>>
>> Why do you think so? We always send the "add" uevent when we register
>> queue and then send "remove" uevent when the queue is unregistered. This
>> behavior should remain intact with this change. The kobj->state_add_uevent_sent
>> is only used for sending "remove" event during automatic cleanup of kobject.
>> It shouldn't have any side effect while we re-register queue or retry
>> add_disk(). Or am I missing something here?
>
> Yes, turns out I misread the code after I tested and found that user
> didn't get the queue uevent. Just dig deeper and realize this is due
> to dev_uevent_filter() from the top devices_kset stop the queue kobj
> uevent.
>
> So, if I don't misread the code now, looks like the kobject_uevent()
> call to queue_kobj is useless.
>
Yes that's correct. And so this patch doesn't alter the queue_kobj uevent behavior and so we're good here.
Thanks,
--Nilay