Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

From: NeilBrown
Date: Tue Apr 04 2017 - 17:53:05 EST


On Tue, Apr 04 2017, Michael Wang wrote:

> On 04/04/2017 11:37 AM, NeilBrown wrote:
>> On Tue, Apr 04 2017, Michael Wang wrote:
> [snip]
>>>>
>>>> If sync_request_write() is using a bio that has already been used, it
>>>> should call bio_reset() and fill in the details again.
>>>> However I don't see how that would happen.
>>>> Can you give specific details on the situation that triggers the bug?
>>>
>>> We have storage side mapping lv through scst to server, on server side
>>> we assemble them into multipath device, and then assemble these dm into
>>> two raid1.
>>>
>>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>>> side we unmap all the lv (could during mkfs or fio), then on server side
>>> we hit the BUG (reproducible).
>>
>> So I assume the initial resync is still happening at this point?
>> And you unmap *all* the lv's so you expect IO to fail?
>> I can see that the code would behave strangely if you have a
>> bad-block-list configured (which is the default).
>> Do you have a bbl? If you create the array without the bbl, does it
>> still crash?
>
> The resync is at least happen concurrently in this case, we try
> to simulate the case that all the connections dropped, the IO do
> failed, also bunch of kernel log like:
>
> md: super_written gets error=-5
> blk_update_request: I/O error, dev dm-3, sector 64184
> md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848
>
> we expect that to happen, but server should not crash on BUG.
>
> And we haven't done any thing special regarding bbl, the bad_blocks
> in sysfs are all empty.
>
>>
>>>
>>> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
>>> with 'bi_next' once chained inside blk_attempt_plug_merge().
>>
>> I still don't see why it is re-used.
>> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
>> to .../md/sync_action at all?). In that case MD_RECOVERY_REQUESTED is
>> not set.
>
> Just unmap lv on storage side, no operation on server side.
>
>> So sync_request() sends only one bio to generic_make_request():
>> r1_bio->bios[r1_bio->read_disk];
>>
>> then sync_request_write() *doesn't* send that bio again, but does send
>> all the others.
>>
>> So where does it reuse a bio?
>
> If that's the design then it would be strange... the log do showing the path
> of that bio go through sync_request(), will do more investigation.
>
>>
>>>
>>> We also tried to reset the bi_next inside sync_request_write() before
>>> generic_make_request() which also works.
>>>
>>> The testing was done with 4.4, but we found upstream also left bi_next
>>> chained after done in request, thus we post this RFC.
>>>
>>> Regarding raid1, we haven't found the place on path where the bio was
>>> reset... where does it supposed to be?
>>
>> I'm not sure what you mean.
>> We only reset bios when they are being reused.
>> One place is in process_checks() where bio_reset() is called before
>> filling in all the details.
>>
>>
>> Maybe, in sync_request_write(), before
>>
>> wbio->bi_rw = WRITE;
>>
>> add something like
>> if (wbio->bi_next)
>> printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>> i, r1_bio->read_disk, wbio->bi_end_io);
>>
>> that might help narrow down what is happening.
>
> Just triggered again in 4.4, dmesg like:
>
> [ 399.240230] md: super_written gets error=-5
> [ 399.240286] md: super_written gets error=-5
> [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

It is very peculiar for r1_bio->bios[r1_bio->read_disk].bi_end_io to be
end_sync_write.
I can only see this happening when sync_request_write() assigns that,
and this printk is before there.
That seems to suggest that sync_request_write() is being performed on
the same r1_bio twice, which is also very peculiar.

I might have to try to reproduce this myself and see what is happening.

Thanks,
NeilBrown

Attachment: signature.asc
Description: PGP signature