Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

From: Jens Axboe
Date: Thu Dec 13 2007 - 15:10:29 EST


On Thu, Dec 13 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >On Thu, Dec 13 2007, Jens Axboe wrote:
> >>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>Jens Axboe wrote:
> >>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>Mark Lord wrote:
> >>>>>>Jens Axboe wrote:
> >>>>>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>>>>Matthew Wilcox wrote:
> >>>>>>>>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>>>>>>>>>Problem confirmed. 2.6.23.8 regularly generates segments up to
> >>>>>>>>>>64KB for libata,
> >>>>>>>>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >>>>>>>>>Just a suspicion ... could this be slab vs slub? ie check your
> >>>>>>>>>configs
> >>>>>>>>>are the same / similar between the two kernels.
> >>>>>>>>..
> >>>>>>>>
> >>>>>>>>Mmmm.. a good thought, that one.
> >>>>>>>>But I just rechecked, and both have CONFIG_SLAB=y
> >>>>>>>>
> >>>>>>>>My guess is that something got changed around when Jens
> >>>>>>>>reworked the block layer for 2.6.24.
> >>>>>>>>I'm going to dig around in there now.
> >>>>>>>I didn't rework the block layer for 2.6.24 :-). The core block layer
> >>>>>>>changes since 2.6.23 are:
> >>>>>>>
> >>>>>>>- Support for empty barriers. Not a likely candidate.
> >>>>>>>- Shared tag queue fixes. Totally unlikely.
> >>>>>>>- sg chaining support. Not likely.
> >>>>>>>- The bio changes from Neil. Of the bunch, the most likely suspects
> >>>>>>>in
> >>>>>>>this area, since it changes some of the code involved with merges and
> >>>>>>>blk_rq_map_sg().
> >>>>>>>- Lots of simple stuff, again very unlikely.
> >>>>>>>
> >>>>>>>Anyway, it sounds odd for this to be a block layer problem if you do
> >>>>>>>see
> >>>>>>>occasional segments being merged. So it sounds more like the input
> >>>>>>>data
> >>>>>>>having changed.
> >>>>>>>
> >>>>>>>Why not just bisect it?
> >>>>>>..
> >>>>>>
> >>>>>>Because the early 2.6.24 series failed to boot on this machine
> >>>>>>due to bugs in the block layer -- so the code that caused this
> >>>>>>regression
> >>>>>>is probably in the stuff from before the kernels became usable here.
> >>>>>..
> >>>>>
> >>>>>That sounds more harsh than intended --> the earlier 2.6.24 kernels
> >>>>>(up to
> >>>>>the first couple of -rc* ones failed here because of incompatibilities
> >>>>>between the block/bio changes and libata.
> >>>>>
> >>>>>That's better, I think!
> >>>>No worries, I didn't pick it up as harsh just as an odd conclusion :-)
> >>>>
> >>>>If I were you, I'd just start from the first -rc that booted for you. If
> >>>>THAT has the bug, then we'll think of something else. If you don't get
> >>>>anywhere, I can run some tests tomorrow and see if I can reproduce it
> >>>>here.
> >>>..
> >>>
> >>>I believe that *anyone* can reproduce it, since it's broken long before
> >>>the requests ever get to SCSI or libata. Which also means that *anyone*
> >>>who wants to can bisect it, as well.
> >>>
> >>>I don't do "bisects".
> >>It was just a suggestion on how to narrow it down, do as you see fit.
> >>
> >>>But I will dig a bit more and see if I can find the culprit.
> >>Sure, I'll dig around as well.
> >
> >Just tried something simple. I only see one 12kb segment so far, so not
> >a lot by any stretch. I also DONT see any missed merges signs, so it
> >would appear that the pages in the request are simply not contigious
> >physically.
> >
> >diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
> >index e30b1a4..1e34b6f 100644
> >--- a/block/ll_rw_blk.c
> >+++ b/block/ll_rw_blk.c
> >@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct
> >request *rq,
> > goto new_segment;
> >
> > sg->length += nbytes;
> >+ if (sg->length > 8192)
> >+ printk("sg_len=%d\n", sg->length);
> > } else {
> > new_segment:
> > if (!sg)
> >@@ -1349,6 +1351,8 @@ new_segment:
> > sg = sg_next(sg);
> > }
> >
> >+ if (bvprv && (page_address(bvprv->bv_page) +
> >bvprv->bv_len == page_address(bvec->bv_page)))
> >+ printk("missed merge\n");
> > sg_set_page(sg, bvec->bv_page, nbytes,
> > bvec->bv_offset);
> > nsegs++;
> > }
> >
> ..
>
> Yeah, the first part is similar to my own hack.
>
> For testing, try "dd if=/dev/sda of=/dev/null bs=4096k".
> That *really* should end up using contiguous pages on most systems.
>
> I figured out the git thing, and am now building some in-between kernels to
> try.

OK, it's a vm issue, I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/