Re: regression caused by block: freeze the queue earlier in del_gendisk

From: Dusty Mabe
Date: Mon Sep 12 2022 - 22:36:21 EST




On 9/12/22 21:55, Ming Lei wrote:
> On Mon, Sep 12, 2022 at 09:16:18AM +0200, Christoph Hellwig wrote:
>> On Fri, Sep 09, 2022 at 04:24:40PM +0800, Ming Lei wrote:
>>> On Wed, Sep 07, 2022 at 09:33:24AM +0200, Christoph Hellwig wrote:
>>>> On Thu, Sep 01, 2022 at 03:06:08PM +0800, Ming Lei wrote:
>>>>> It is a bit hard to associate the above commit with reported issue.
>>>>
>>>> So the messages clearly are about something trying to open a device
>>>> that went away at the block layer, but somehow does not get removed
>>>> in time by udev (which seems to be a userspace bug in CoreOS). But
>>>> even with that we really should not hang.
>>>
>>> Xiao Ni provides one script[1] which can reproduce the issue more or less.
>>
>> I've run the reproduced 10000 times on current mainline, and while
>> it prints one of the autoloading messages per run, I've not actually
>> seen any kind of hang.
>
> I can't reproduce the hang too.

I obviously can reproduce the issue with the test in our Fedora CoreOS
test suite. It's part of a framework (i.e. it's not simple some script
you can run) but it is very reproducible so one can add some instrumentation
to the kernel and feed it through a build/test cycle to see different
results or logs.

I'm willing to share this with other people (maybe a screen share or
some written down instructions) if anyone would be interested.


>
> What I meant is that new raid disk can be added by mdadm after stopping
> the imsm container and raid disk with the autoloading messages printed,
> I understand this behavior isn't correct, but I am not familiar with
> raid enough.
>
> It might be related with the delay deleting gendisk from wq & md kobj
> release handler.
>
> During reboot, if mdadm does this stupid thing without stopping, the hang
> could be caused.
>
> I think the root cause is that why mdadm tries to open/add new raid bdev
> crazily during reboot.
>

Dusty