Re: multi-second application stall in open()

From: Vivek Goyal
Date: Wed Mar 07 2012 - 14:56:43 EST


On Wed, Mar 07, 2012 at 07:56:10PM +0100, Jens Axboe wrote:
[..]

> >
> > blktrace of cfq look odd. I see that some IO (async writes) are being
> > submitted but CFQ did not dispatch it for a long time. Even some unplugs
> > came in still nothing happened. Also no completions are happening during
> > that window. Not sure why CFQ refuses to dispatch queued writes.
> >
> > Request added by flusher.
> >
> > 8,0 1 36926 5028.546000122 2846 A W 20147012 + 8 <- (8,3)
> > 3375152
> > 8,0 1 36927 5028.546001798 2846 Q W 20147012 + 8 [flush-8:0]
> > 8,0 1 36928 5028.546009900 2846 G W 20147012 + 8 [flush-8:0]
> > 8,0 1 36929 5028.546014649 2846 I W 20147012 + 8 ( 4749)
> > [flush-8:0]
> >
> > And this request is dispatched after 22 seconds.
> >
> > 8,0 1 37056 5050.117337221 162 D W 20147012 + 16 (21571322572) [sync_supers]
> >
> >
> > And it completes fairly fast.
> >
> > 8,0 0 36522 5050.117686149 9657 C W 20147012 + 16 ( 348928)
> > [0]
> >
> > So not sure why CFQ will hold that request for so long when other IO is
> > not happening.
> >
> > Please try latest kernels and see if deadline has the same issue. If not,
> > then we know somehow CFQ is related. If it still happens on latest
> > kernels, can you try capturing blktrace again when you are experiencing
> > the delays.
>
> I'm seeing something very similar here. While testing the gtk fio
> client, I ran a job that issued a lot of random reads to my primary
> disk. 64 ios in flight, direct, libaio, 512b random reads. Firefox
> essentially froze, windows starting freezing up around me.
>
> I'll try and reproduce, but a quick guess would be that things starting
> piling up in fsync() or stalling on writes in general, since we are
> heavily starving those.

Quite possible. Other people also had reported write starvation issues. I
have got reports of "hung task timeout of 120 seconds" reports in presence
of sync IO happening on same disk/partition.

We probably need to do something about write starvation. I had posted one
patch to make sure we dispatch atleast one WRITE after we were waiting for
pending sync requests to finish.

https://lkml.org/lkml/2011/6/10/326

This might help a bit but might not prevent servere delays in dispatching
async writes as things are so heavily loaded in favor or sync IO.

BTW, in this case, I did not see any sync IO completions happening while
async was not being dispatched. That's little odd.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/