Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements")

From: Jens Axboe
Date: Wed Feb 08 2017 - 12:44:37 EST


On 02/08/2017 03:48 AM, Dexuan Cui wrote:
>> From: Jens Axboe [mailto:axboe@xxxxxxxxx]
>> Sent: Wednesday, February 8, 2017 00:09
>> To: Dexuan Cui <decui@xxxxxxxxxxxxx>; Bart Van Assche
>> <Bart.VanAssche@xxxxxxxxxxx>; hare@xxxxxxxx; hare@xxxxxxx
>> Cc: hch@xxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-block@xxxxxxxxxxxxxxx;
>> jth@xxxxxxxxxx
>> Subject: Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue
>> elements
>>
>> On 02/06/2017 11:29 PM, Dexuan Cui wrote:
>>>> From: linux-block-owner@xxxxxxxxxxxxxxx [mailto:linux-block-
>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Dexuan Cui
>>>> with the linux-next kernel.
>>>>
>>>> I can boot the guest with linux-next's next-20170130 without any issue,
>>>> but since next-20170131 I haven't succeeded in booting the guest.
>>>>
>>>> With next-20170203 (mentioned in my mail last Friday), I got the same
>>>> calltrace as Hannes.
>>>>
>>>> With today's linux-next (next-20170206), actually the calltrace changed to
>>>> the below.
>>>> [ 122.023036] ? remove_wait_queue+0x70/0x70
>>>> [ 122.051383] async_synchronize_full+0x17/0x20
>>>> [ 122.076925] do_init_module+0xc1/0x1f9
>>>> [ 122.097530] load_module+0x24bc/0x2980
>>>
>>> I don't know why it hangs here, but this is the same calltrace in my
>>> last-Friday mail, which contains 2 calltraces. It looks the other calltrace has
>>> been resolved by some changes between next-20170203 and today.
>>>
>>> Here the kernel is trying to load the Hyper-V storage driver (hv_storvsc), and
>>> the driver's __init and .probe have finished successfully and then the kernel
>>> hangs here.
>>>
>>> I believe something is broken recently, because I don't have any issue before
>>> Jan 31.
>>
>> Can you try and bisect it?
>>
>> Jens Axboe
>
> I bisected it on the branch for-4.11/next of the linux-block repo and the log shows
> the first bad commit is
> [e9c787e6] scsi: allocate scsi_cmnd structures as part of struct request
>
> # git bisect log
> git bisect start
> # bad: [80c6b15732f0d8830032149cbcbc8d67e074b5e8] blk-mq-sched: (un)register elevator when (un)registering queue
> git bisect bad 80c6b15732f0d8830032149cbcbc8d67e074b5e8
> # good: [309bd96af9e26da3038661bf5cdad780eef49dd9] md: cleanup bio op / flags handling in raid1_write_request
> git bisect good 309bd96af9e26da3038661bf5cdad780eef49dd9
> # bad: [27410a8927fb89bd150de08d749a8ed7f67b7739] nbd: remove REQ_TYPE_DRV_PRIV leftovers
> git bisect bad 27410a8927fb89bd150de08d749a8ed7f67b7739
> # bad: [e9c787e65c0c36529745be47d490d998b4b6e589] scsi: allocate scsi_cmnd structures as part of struct request
> git bisect bad e9c787e65c0c36529745be47d490d998b4b6e589
> # good: [3278255741326b6d66d8ca7d1cb2c57633ee43d9] scsi_dh_rdac: switch to scsi_execute_req_flags()
> git bisect good 3278255741326b6d66d8ca7d1cb2c57633ee43d9
> # good: [0fbc3e0ff623f1012e7c2af96e781eeb26bcc0d7] scsi: remove gfp_flags member in scsi_host_cmd_pool
> git bisect good 0fbc3e0ff623f1012e7c2af96e781eeb26bcc0d7
> # good: [eeff68c5618c8d0920b14533c70b2df007bd94b4] scsi: remove scsi_cmd_dma_pool
> git bisect good eeff68c5618c8d0920b14533c70b2df007bd94b4
> # good: [d48777a633d6fa7ccde0f0e6509f0c01fbfc5299] scsi: remove __scsi_alloc_queue
> git bisect good d48777a633d6fa7ccde0f0e6509f0c01fbfc5299
> # first bad commit: [e9c787e65c0c36529745be47d490d998b4b6e589] scsi: allocate scsi_cmnd structures as part of struct request

Christoph?

I've changed the subject line, this issue has nothing to do with the
issue that Hannes was attempting to fix.

--
Jens Axboe