[PATCH] block: always requeue !fs requests at the front

From: Tejun Heo
Date: Fri Jun 15 2007 - 05:43:08 EST


SCSI marks internal commands with REQ_PREEMPT and push it at the front
of the request queue using blk_execute_rq(). When entering suspended
or frozen state, SCSI devices are quiesced using
scsi_device_quiesce(). In quiesced state, only REQ_PREEMPT requests
are processed. This is how SCSI blocks other requests out while
suspending and resuming. As all internal commands are pushed at the
front of the queue, this usually works.

Unfortunately, this interacts badly with ordered requeueing. To
preserve request order on requeueing (due to busy device, active EH or
other failures), requests are sorted according to ordered sequence on
requeue if IO barrier is in progress.

The following sequence deadlocks.

1. IO barrier sequence issues.

2. Suspend requested. Queue is quiesced with part of all of IO
barrier sequence at the front.

3. During suspending or resuming, SCSI issues internal command which
gets deferred and requeued for some reason. As the command is
issued after the IO barrier in #1, ordered requeueing code puts the
request after IO barrier sequence.

4. The device is ready to process requests again but still is in
quiesced state and the first request of the queue isn't
REQ_PREEMPT, so command processing is deadlocked -
suspending/resuming waits for the issued request to complete while
the request can't be processed till device is put back into
running state by resuming.

This can be fixed by always putting !fs requests at the front when
requeueing.

The following thread reports this deadlock.

http://thread.gmane.org/gmane.linux.kernel/537473

Signed-off-by: Tejun Heo <htejun@xxxxxxxxx>
Cc: Jenn Axboe <jens.axboe@xxxxxxxxxx>
Cc: David Greaves <david@xxxxxxxxxxxx>
---
Okay, it took a lot of hours of debugging but boiled down to two liner
fix. I feel so empty. :-) RAID6 triggers this reliably because it
uses BIO_BARRIER heavily to update its superblock. The recent ATA
suspend/resume rewrite is hit by this because it uses SCSI internal
commands to spin down and up the drives for suspending and resuming.

David, please test this. Jens, does it look okay?

Thanks.

block/ll_rw_blk.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 6b5173a..a2fe2e5 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -340,6 +340,14 @@ unsigned blk_ordered_req_seq(struct request *rq)
if (rq == &q->post_flush_rq)
return QUEUE_ORDSEQ_POSTFLUSH;

+ /* !fs requests don't need to follow barrier ordering. Always
+ * put them at the front. This fixes the following deadlock.
+ *
+ * http://thread.gmane.org/gmane.linux.kernel/537473
+ */
+ if (!blk_fs_request(rq))
+ return QUEUE_ORDSEQ_DRAIN;
+
if ((rq->cmd_flags & REQ_ORDERED_COLOR) ==
(q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR))
return QUEUE_ORDSEQ_DRAIN;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/