Re: [PATCH 04/10] block: initial patch for on-stack per-taskplugging

From: Shaohua Li
Date: Mon Mar 21 2011 - 02:53:18 EST

Next message: H.J. Lu: "Re: X32 psABI status update"
Previous message: Private Money Lender Inc: "Affordable Loans at 2% interest rates.....Apply today?"
In reply to: Jens Axboe: "Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging"
Next in thread: Jens Axboe: "Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 2011-03-18 at 21:52 +0800, Jens Axboe wrote:
> On 2011-03-18 13:54, Jens Axboe wrote:
> > On 2011-03-18 07:36, Shaohua Li wrote:
> >> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
> >>> On 2011-03-17 02:00, Shaohua Li wrote:
> >>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>>>>> 2011/1/22 Jens Axboe <jaxboe@xxxxxxxxxxxx>:
> >>>>>>> Signed-off-by: Jens Axboe <jaxboe@xxxxxxxxxxxx>
> >>>>>>> ---
> >>>>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
> >>>>>>> block/elevator.c | 6 +-
> >>>>>>> include/linux/blk_types.h | 2 +
> >>>>>>> include/linux/blkdev.h | 30 ++++
> >>>>>>> include/linux/elevator.h | 1 +
> >>>>>>> include/linux/sched.h | 6 +
> >>>>>>> kernel/exit.c | 1 +
> >>>>>>> kernel/fork.c | 3 +
> >>>>>>> kernel/sched.c | 11 ++-
> >>>>>>> 9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>>>> index 960f12c..42dbfcc 100644
> >>>>>>> --- a/block/blk-core.c
> >>>>>>> +++ b/block/blk-core.c
> >>>>>>> @@ -27,6 +27,7 @@
> >>>>>>> #include <linux/writeback.h>
> >>>>>>> #include <linux/task_io_accounting_ops.h>
> >>>>>>> #include <linux/fault-inject.h>
> >>>>>>> +#include <linux/list_sort.h>
> >>>>>>>
> >>>>>>> #define CREATE_TRACE_POINTS
> >>>>>>> #include <trace/events/block.h>
> >>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>>>>
> >>>>>>> q = container_of(work, struct request_queue, delay_work.work);
> >>>>>>> spin_lock_irq(q->queue_lock);
> >>>>>>> - q->request_fn(q);
> >>>>>>> + __blk_run_queue(q);
> >>>>>>> spin_unlock_irq(q->queue_lock);
> >>>>>>> }
> >>>>>> Hi Jens,
> >>>>>> I have some questions about the per-task plugging. Since the request
> >>>>>> list is per-task, and each task delivers its requests at finish flush
> >>>>>> or schedule. But when one cpu delivers requests to global queue, other
> >>>>>> cpus don't know. This seems to have problem. For example:
> >>>>>> 1. get_request_wait() can only flush current task's request list,
> >>>>>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>>>>> to request_queue.
> >>>>>
> >>>>> But very soon these requests will be sent to request queue as soon task
> >>>>> is either scheduled out or task explicitly flushes the plug? So we might
> >>>>> wait a bit longer but that might not matter in general, i guess.
> >>>> Yes, I understand there is just a bit delay. I don't know how severe it
> >>>> is, but this still could be a problem, especially for fast storage or
> >>>> random I/O. My current tests show slight regression (3% or so) with
> >>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> >>>> per-task plug, but the per-task plug is highly suspected.
> >>>
> >>> To check this particular case, you can always just bump the request
> >>> limit. What test is showing a slowdown?
> >> this is a simple multi-threaded seq read. The issue tends to be request
> >> merge related (not verified yet). The merge reduces about 60% with stack
> >> plug from fio reported data. From trace, without stack plug, requests
> >> from different threads get merged. But with it, such merge is impossible
> >> because flush_plug doesn't check merge, I thought we need add it again.
> >
> > What we could try is have the plug flush insert be
> > ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges.
> >
> > Here's a quick hack that does that, I have not tested it at all.
>
> Gave it a quick test spin, as suspected it had a few issues. This one
> seems to work. Can you toss it through that workload and see if it fares
> better?
yes, this fully restores the regression I saw. But I have accounting
issue:
1. The merged request is already accounted when it's added into plug
list
2. drive_stat_acct() is called without any protection in
__make_request(). So there is race for in_flight accounting. The race
exists after stack plug is added, so not because of this issue.
Below is the extra patch I need to do the test.

---
block/blk-merge.c | 12 +++++-------
block/elevator.c | 9 ++++++---
drivers/md/dm.c | 7 ++++---
fs/partitions/check.c | 3 ++-
include/linux/genhd.h | 12 ++++++------
5 files changed, 23 insertions(+), 20 deletions(-)

Index: linux-2.6/block/blk-merge.c
===================================================================
--- linux-2.6.orig/block/blk-merge.c
+++ linux-2.6/block/blk-merge.c
@@ -429,14 +429,12 @@ static int attempt_merge(struct request_

req->__data_len += blk_rq_bytes(next);

- if (next->cmd_flags & REQ_SORTED) {
- elv_merge_requests(q, req, next);
+ elv_merge_requests(q, req, next);

- /*
- * 'next' is going away, so update stats accordingly
- */
- blk_account_io_merge(next);
- }
+ /*
+ * 'next' is going away, so update stats accordingly
+ */
+ blk_account_io_merge(next);

req->ioprio = ioprio_best(req->ioprio, next->ioprio);
if (blk_rq_cpu_valid(next))
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -566,13 +566,16 @@ void elv_merge_requests(struct request_q
{
struct elevator_queue *e = q->elevator;

- if (e->ops->elevator_merge_req_fn)
+ if ((next->cmd_flags & REQ_SORTED) && e->ops->elevator_merge_req_fn)
e->ops->elevator_merge_req_fn(q, rq, next);

elv_rqhash_reposition(q, rq);
- elv_rqhash_del(q, next);

- q->nr_sorted--;
+ if (next->cmd_flags & REQ_SORTED) {
+ elv_rqhash_del(q, next);
+ q->nr_sorted--;
+ }
+
q->last_merge = rq;
}

Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c
+++ linux-2.6/drivers/md/dm.c
@@ -477,7 +477,8 @@ static void start_io_acct(struct dm_io *
cpu = part_stat_lock();
part_round_stats(cpu, &dm_disk(md)->part0);
part_stat_unlock();
- dm_disk(md)->part0.in_flight[rw] = atomic_inc_return(&md->pending[rw]);
+ atomic_set(&dm_disk(md)->part0.in_flight[rw],
+ atomic_inc_return(&md->pending[rw]));
}

static void end_io_acct(struct dm_io *io)
@@ -497,8 +498,8 @@ static void end_io_acct(struct dm_io *io
* After this is decremented the bio must not be touched if it is
* a flush.
*/
- dm_disk(md)->part0.in_flight[rw] = pending =
- atomic_dec_return(&md->pending[rw]);
+ pending = atomic_dec_return(&md->pending[rw]);
+ atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
pending += atomic_read(&md->pending[rw^0x1]);

/* nudge anyone waiting on suspend queue */
Index: linux-2.6/fs/partitions/check.c
===================================================================
--- linux-2.6.orig/fs/partitions/check.c
+++ linux-2.6/fs/partitions/check.c
@@ -290,7 +290,8 @@ ssize_t part_inflight_show(struct device
{
struct hd_struct *p = dev_to_part(dev);

- return sprintf(buf, "%8u %8u\n", p->in_flight[0], p->in_flight[1]);
+ return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
+ atomic_read(&p->in_flight[1]));
}

#ifdef CONFIG_FAIL_MAKE_REQUEST
Index: linux-2.6/include/linux/genhd.h
===================================================================
--- linux-2.6.orig/include/linux/genhd.h
+++ linux-2.6/include/linux/genhd.h
@@ -109,7 +109,7 @@ struct hd_struct {
int make_it_fail;
#endif
unsigned long stamp;
- int in_flight[2];
+ atomic_t in_flight[2];
#ifdef CONFIG_SMP
struct disk_stats __percpu *dkstats;
#else
@@ -370,21 +370,21 @@ static inline void free_part_stats(struc

static inline void part_inc_in_flight(struct hd_struct *part, int rw)
{
- part->in_flight[rw]++;
+ atomic_inc(&part->in_flight[rw]);
if (part->partno)
- part_to_disk(part)->part0.in_flight[rw]++;
+ atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
}

static inline void part_dec_in_flight(struct hd_struct *part, int rw)
{
- part->in_flight[rw]--;
+ atomic_dec(&part->in_flight[rw]);
if (part->partno)
- part_to_disk(part)->part0.in_flight[rw]--;
+ atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
}

static inline int part_in_flight(struct hd_struct *part)
{
- return part->in_flight[0] + part->in_flight[1];
+ return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
}

static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: H.J. Lu: "Re: X32 psABI status update"
Previous message: Private Money Lender Inc: "Affordable Loans at 2% interest rates.....Apply today?"
In reply to: Jens Axboe: "Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging"
Next in thread: Jens Axboe: "Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]