Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

From: Mel Gorman
Date: Mon Mar 02 2009 - 07:16:46 EST

Next message: Mike Rapoport: "Re: [PATCH] rtc-v3020: add ability to access v3020 chip with GPIOs"
Previous message: Ingo Molnar: "Re: [PATCH] xen: core dom0 support"
In reply to: Nick Piggin: "Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2"
Next in thread: Nick Piggin: "Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> On Mon, Mar 02, 2009 at 11:21:22AM +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> >
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > >
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > >
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > >
> >
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> >
> > The OLTP results had the following things to say about the page allocator.
>
> Is this OLTP, or UDP-U-4K?
>

OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
looking at other than to note that high-order allocations may be a
bigger deal there.

>
> > Samples in the free path
> > vanilla: 6207
> > mg-v2: 4911
> > Samples in the allocation path
> > vanilla 19948
> > mg-v2: 14238
> >
> > This is based on glancing at the following graphs and not counting the VM
> > counters as it can't be determined which samples are due to the allocator
> > and which are due to the rest of the VM accounting.
> >
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> >
> > So the path costs are reduced in both cases. Whatever caused the regression
> > there doesn't appear to be in time spent in the allocator but due to
> > something else I haven't imagined yet. Other oddness
> >
> > o According to the profile, something like 45% of time is spent entering
> > the __alloc_pages_nodemask() function. Function entry costs but not
> > that much. Another significant part appears to be in checking a simple
> > mask. That doesn't make much sense to me so I don't know what to do with
> > that information yet.
> >
> > o In get_page_from_freelist(), 9% of the time is spent deleting a page
> > from the freelist.
> >
> > Neither of these make sense, we're not spending time where I would expect
> > to at all. One of two things are happening. Something like cache misses or
> > bounces are dominating for some reason that is specific to this machine. Cache
> > misses are one possibility that I'll check out. The other is that the sample
> > rate is too low and the profile counts are hence misleading.
> >
> > Question 1: Would it be possible to increase the sample rate and track cache
> > misses as well please?
>
> If the events are constantly biased, I don't think sample rate will
> help. I don't know how the internals of profiling counters work exactly,
> but you would expect yes cache misses, and stalls from any number of
> different resources could put results in funny places.
>

Ok, if it's stalls that are the real factor then yes, increasing the
sample rate might not help. However, the same rates for instructions
were so low, I thought it might be a combination of both low sample
count and stalls happening at particular places. A profile of cache
misses will still be useful as it'll say in general if there is a marked
increase overall or not.

> Intel's OLTP workload is very sensitive to cacheline footprint of the
> kernel, and if you touch some extra cachelines at point A, it can just
> result in profile hits getting distributed all over the place. Profiling
> cache misses might help, but probably see a similar phenomenon.
>

Interesting, this might put a hole in replacing the gfp_zone() with a
version that uses an additional (or maybe two depending on alignment)
cacheline.

> I can't remember, does your latest patchset include any patches that change
> the possible order in which pages move around? Or is it just made up of
> straight-line performance improvement of existing implementation?
>

It shouldn't affect order. I did a test a while ago to make sure pages
were still coming back in contiguous order as some IO cards depend on this
behaviour for performance. The intention for the first pass is a straight-line
performance improvement.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mike Rapoport: "Re: [PATCH] rtc-v3020: add ability to access v3020 chip with GPIOs"
Previous message: Ingo Molnar: "Re: [PATCH] xen: core dom0 support"
In reply to: Nick Piggin: "Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2"
Next in thread: Nick Piggin: "Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]