Crash in elevator_dispatch_fn() (e.g. deadline_dispatch()) whenchanging elevators.
From: Frank Mayhar
Date: Tue Jan 21 2014 - 11:06:25 EST
On Sat, 2014-01-18 at 09:31 -0500, Tejun Heo wrote:
> Hello, Frank.
>
> On Fri, Jan 17, 2014 at 01:59:36PM -0800, Frank Mayhar wrote:
> > After investigation, it's clear that the elevator switch code is trying
> > to quiesce the request queue and sets the bypass flag. Unfortunately,
> > scsi_mode_sense() and friends call blk_execute_rq() directly, which pays
> > no attention to said flag.
>
> Ouch.
>
> > In fact, the bypass flag, as far as I can tell, is only checked in the
> > blk_cgroup.c, in blkg_lookup() and blkg_lookup_create().
>
> Hmmm? IIRC the main place is in __get_request(). If the queue is
> bypassing, the request doesn't get REQ_ELVPRIV which basically
> indicates that the request shouldn't go through elevator. I don't
> think that part is broken even for scsi_execute_rq() - it uses
> blk_get_request() which should handle it properly.
>
> > So my question is: Is this a simple oversight? Should blk_execute_rq()
>
> Yeah, it's a bug. It's not supposed to crash like that.
Clearly. :-)
> > care about the bypass flag? Should it perhaps hold off the I/O until
> > the bypass flag is cleared? Since at that level it has nothing to do
> > with cgroups I kinda don't think so but I'm still trying to get my head
> > around how all this stuff goes together.
>
> The cgroup part isn't really relevant. That's just because cgroup
> also uses bypassing to quiesce the queue when it's changing blkcg
> policies. The thing relevant to the elveator is the check in
> __get_request().
>
> Hmm.... the culprit is that we don't have any check in the dispatch
> path. It doesn't really matter who's calling blk_peek_request(), that
> function is called as part of all request dispatches and should never
> crash. I think the only reason we haven't noticed yet is because
> calling evelator dispatch_fn doesn't crash in most cases even if the
> elevator isn't in fully working condition.
>
> Probably what we need is replacing blk_queue_dying() with
> blk_queue_bypass() test in __elv_next_request() before invoking the
> elevator dispatch function.
Replacing? Or adding to? Is BYPASS always set when DYING is set? (My
guess is not but I haven't done an exhaustive analysis.) So the
relevant code snippet in __elv_next_request() would be:
if (unlikely(blk_queue_dying(q)) ||
unlikely(blk_queue_bypass(q)) ||
!q->elevator->type->ops.elevator_dispatch_fn(q, 0))
return NULL;
> Can you please reply w/ Jens and lkml cc'd with the original
> description and my response? No reason to keep this private.
Sure, doing so. I just wanted to make sure my understanding was correct
before bringing others into it.
> Thanks a lot for the report!
No problem. Thanks for the quick response. Sorry for my
three-day-weekend-delayed reply. :-)
Jens, others, my original email follows:
> Hey, Tejun. I have a question about stuff going on in the block layer
> that's not quite as straightforward as one might wish. We have a
> situation in which we need to use something other than the CFQ I/O
> scheduler so after we create a block device (iSCSI, as it happens) we
> switch it to use the deadline scheduler. Unfortunately we've run into a
> crash when we do this in which a request completion (in some cases) or a
> sync I/O (in others) calls blk_peek_request() which in turn calls
> __elv_next_request() which in its turn calls
> q->elevator->type->ops.elevator_dispatch_fn(), which happens at the
> point of the call be deadline_dispatch_requests(), which promptly falls
> over because we're in the middle of the elevator switch and
> q->elevator->elevator_data is NULL.
>
> After investigation, it's clear that the elevator switch code is trying
> to quiesce the request queue and sets the bypass flag. Unfortunately,
> scsi_mode_sense() and friends call blk_execute_rq() directly, which pays
> no attention to said flag.
>
> In fact, the bypass flag, as far as I can tell, is only checked in the
> blk_cgroup.c, in blkg_lookup() and blkg_lookup_create().
>
> So my question is: Is this a simple oversight? Should blk_execute_rq()
> care about the bypass flag? Should it perhaps hold off the I/O until
> the bypass flag is cleared? Since at that level it has nothing to do
> with cgroups I kinda don't think so but I'm still trying to get my head
> around how all this stuff goes together.
>
> Any hints would be _greatly_ appreciated. Thanks!
--
Frank Mayhar
310-460-4042
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/