Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)

From: Mel Gorman
Date: Wed Jul 02 2008 - 01:18:20 EST


On (01/07/08 13:29), Dan Williams didst pronounce:
>
> On Tue, 2008-07-01 at 12:07 -0700, Mel Gorman wrote:
> > On (01/07/08 18:58), Andy Whitcroft didst pronounce:
> > > > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
> > > > > Other suggestions / experiments?
> > > > >
> > >
> > > Looking at the commit in question (54a6eb5c) there is one slight anomoly
> > > in the conversion. When nr_free_zone_pages() was converted to the new
> > > iterators it started using the offset parameter to limit the zones
> > > traversed; which is not unreasonable as that appears to be the
> > > parameters purpose. However, if we look at the original implementation
> > > of this function (reproduced below) we can see it actually did nothing
> > > with this parameter:
> > >
> > > static unsigned int nr_free_zone_pages(int offset)
> > > {
> > > /* Just pick one node, since fallback list is circular */
> > > unsigned int sum = 0;
> > >
> > > struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
> > > struct zone **zonep = zonelist->zones;
> > > struct zone *zone;
> > >
> > > for (zone = *zonep++; zone; zone = *zonep++) {
> > > unsigned long size = zone->present_pages;
> > > unsigned long high = zone->pages_high;
> > > if (size > high)
> > > sum += size - high;
> > > }
> > >
> > > return sum;
> > > }
> > >
> >
> > This looks kinda promising and depends heavily on how this patch was
> > tested in isolation. Dan, can you post the patch you use on 2.6.25
> > because the commit in question should not have applied cleanly please?
> >
> > To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with
> > the right zones in it. However, with two-zonelist, there is only one that
> > gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets
> > filtered based on offset. However, if this patch was tested in isolation,
> > it could result in bogus values of vm_total_pages. Dan, can you confirm
> > in your dmesg logs that the line like the following has similar values
> > please?
> >
> > Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258544
>
> The system is booted with mem=1024M on the kernel command line and with
> or without Andy's patch this reports:
>
> Built 1 zonelists in Zone order, mobility grouping on. Total pages: 227584
>
> Performance is still sporadic with the change. Moreover this condition
> is reproducing even with CONFIG_NOHIGHMEM=y.
>
> Let us take commit 8b3e6cdc out of the equation and just look at raid0
> performance:
>
> revision 2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
> 279 278 273 277
> 281 278 275 277
> 281 113 68.7 66.8
> 279 69.2 277 73.7
> 278 75.6 62.5 80.3
> MB/s (avg) 280 163 191 155
> % change 0% -42% -32% -45%
> result base bad bad bad
>

Ok, based on your other mail, 54a6eb5c here is a bisection point. The good
figures are on par with the "good" kernel with some disasterous runs leading
to a bad average. The thing is, the bad results are way worse than could be
accounted for by two-zonelist alone. In fact, the figures look suspiciously
like only 1 disk is in use as they are roughly quartered. Can you think of
anything that would explain that? Can you also confirm that using a bisection
point before two-zonelist runs steadily and with high performance as expected
please? This is to rule out some other RAID patch being a factor.

It would be worth running vmstat during the tests so we can see if IO
rates are dropping from an overall system perspective. If possible,
oprofile data from the same time would be helpful to see does it show up
where we are getting stuck.

> These numbers are taken from the results of:
> for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
>
> Where md0 is created by:
> mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0
>
> I will try your debug patch next Mel, and then try to collect more data
> with blktrace.
>

I tried reproducing this but I don't have the necessary hardware to even
come close to reproducing your test case :( . I got some dbench results
with oprofile but found no significant differences between 2.6.25 and
2.6.26-rc8. However, I did find the results of dbench varied less between
runs with the "repork" patch that made next_zones_zonelist() an inline
function. Have you tried that patch yet?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/