Re: [PATCH 04/10] block: initial patch for on-stack per-taskplugging

From: Shaohua Li
Date: Wed Mar 16 2011 - 21:00:22 EST


On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> > 2011/1/22 Jens Axboe <jaxboe@xxxxxxxxxxxx>:
> > > Signed-off-by: Jens Axboe <jaxboe@xxxxxxxxxxxx>
> > > ---
> > > block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
> > > block/elevator.c | 6 +-
> > > include/linux/blk_types.h | 2 +
> > > include/linux/blkdev.h | 30 ++++
> > > include/linux/elevator.h | 1 +
> > > include/linux/sched.h | 6 +
> > > kernel/exit.c | 1 +
> > > kernel/fork.c | 3 +
> > > kernel/sched.c | 11 ++-
> > > 9 files changed, 317 insertions(+), 100 deletions(-)
> > >
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 960f12c..42dbfcc 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -27,6 +27,7 @@
> > > #include <linux/writeback.h>
> > > #include <linux/task_io_accounting_ops.h>
> > > #include <linux/fault-inject.h>
> > > +#include <linux/list_sort.h>
> > >
> > > #define CREATE_TRACE_POINTS
> > > #include <trace/events/block.h>
> > > @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> > >
> > > q = container_of(work, struct request_queue, delay_work.work);
> > > spin_lock_irq(q->queue_lock);
> > > - q->request_fn(q);
> > > + __blk_run_queue(q);
> > > spin_unlock_irq(q->queue_lock);
> > > }
> > Hi Jens,
> > I have some questions about the per-task plugging. Since the request
> > list is per-task, and each task delivers its requests at finish flush
> > or schedule. But when one cpu delivers requests to global queue, other
> > cpus don't know. This seems to have problem. For example:
> > 1. get_request_wait() can only flush current task's request list,
> > other cpus/tasks might still have a lot of requests, which aren't sent
> > to request_queue.
>
> But very soon these requests will be sent to request queue as soon task
> is either scheduled out or task explicitly flushes the plug? So we might
> wait a bit longer but that might not matter in general, i guess.
Yes, I understand there is just a bit delay. I don't know how severe it
is, but this still could be a problem, especially for fast storage or
random I/O. My current tests show slight regression (3% or so) with
Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
per-task plug, but the per-task plug is highly suspected.

> > your ioc-rq-alloc branch is for this, right? Will it
> > be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> > queue length. If there are enough requests there, we force a flush
> > plug.
>
> That's the idea jens had. But then came the question of maintaining
> data structures per task per disk. That makes it complicated.
>
> Even if we move the accounting out of request queue and do it say at
> bdi, ideally we shall to do per task per bdi accounting.
>
> Jens seemed to be suggesting that generally fluser threads are the
> main cluprit for submitting large amount of IO. They are already per
> bdi. So probably just maintain a per task limit for flusher threads.
Yep, flusher is the main spot in my mind. We need call more flush plug
for flusher thread.

> I am not sure what happens to direct reclaim path, AIO deep queue
> paths etc.
direct reclaim path could build deep write queue too. It
uses .writepage, currently there is no flush plug there. Maybe we need
add flush plug in shrink_inactive_list too.

> > 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> > not work. because other CPUs might not dispatch their requests to
> > request queue. So __blk_run_queue will eventually find no requests,
> > which might stall devices.
> > Since one cpu doesn't know other cpus' request list, I'm wondering if
> > there are other similar issues.
>
> So again in this case if queue is empty at the time of __blk_run_queue(),
> then we will probably just experinece little more delay then intended
> till some task flushes. But should not stall the system?
not stall the system, but device stalls a little time.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/