Re: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

From: Bart Van Assche
Date: Tue Jan 23 2018 - 11:43:53 EST


On Tue, 2018-01-23 at 10:22 +0100, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 5:20pm -0500,
> Bart Van Assche <Bart.VanAssche@xxxxxxx> wrote:
>
> > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> > > And yet Laurence cannot reproduce any such lockups with your test...
> >
> > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
> > already succeeded at running an unmodified version of my tests. In one of the
> > e-mails Laurence sent me this morning I read that he modified these scripts
> > to get past a kernel module unload failure that was reported while starting
> > these tests. So the next step is to check which changes were made to the test
> > scripts and also whether the test results are still valid.
> >
> > > Are you absolutely certain this patch doesn't help you?
> > > https://patchwork.kernel.org/patch/10174037/
> > >
> > > If it doesn't then that is actually very useful to know.
> >
> > The first I tried this morning is to run the srp-test software against a merge
> > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
> > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
> > Since even that was not sufficient I tried to kick the queues via debugfs (for
> > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
> > not sufficient to resolve the queue stall I reverted the following tree patches
> > that are in Jens' tree:
> > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
> > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
> > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"
> >
> > Only after I had done this the srp-test software ran again without triggering
> > dm queue lockups.

Hello Mike,

I want to take back what I wrote about the reverts. I have not yet tried to
analyze exactly which change made blk_insert_cloned_request() work reliably
for me but with Jens' latest for-next branch and your for-4.16 branch merged
all I need to avoid queue stalls is the following:

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f16096af879a..c59c59cfd2a5 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
+ blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}

Bart.