Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zoneeligible for zone_reclaim

From: Mel Gorman
Date: Mon Jul 18 2011 - 17:13:35 EST

On Mon, Jul 18, 2011 at 12:20:11PM -0500, Christoph Lameter wrote:
> On Mon, 18 Jul 2011, Mel Gorman wrote:
> > On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> > > On Fri, 15 Jul 2011, Mel Gorman wrote:
> > >
> > > > Currently the zonelist cache is setup only after the first zone has
> > > > been considered and zone_reclaim() has been called. The objective was
> > > > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > > > it is failing regularly such as the first eligible zone having mostly
> > > > mapped pages, the cost in scanning and allocation stalls is far higher
> > > > than the ZLC initialisation step.
> > >
> > > Would it not be easier to set zlc_active and allowednodes based on the
> > > zone having an active ZLC at the start of get_pages()?
> > >
> >
> > What do you mean by a zones active ZLC? zonelists are on a per-node,
> > not a per-zone basis (see node_zonelist) so a zone doesn't have an
> > active ZLC as such. If the zlc_active is set at the beginning of
> Look at get_page_from_freelist(): It sets
> zlc_active = 0 even through the zonelist under consideration may have a
> ZLC. zlc_active = 0 can also mean that the function has not bothered to
> look for the zlc information of the current zonelist.

Yes. So? It's only necessary if the watermarks are not met.

> > get_page_from_freelist(), it implies that we are calling zlc_setup()
> > even when the watermarks are met which is unnecessary.
> Ok then that decision to not call zlc_setup() for performance reasons is
> what created the problem that you are trying to solve. In case that the
> first zones watermarks are okay we can avoid calling zlc_setup().

The original implementation did not check the ZLC in the first loop
at all. It wasn't just about avoiding the cost of setup. I suspect
this problem has been there a long time and it's taking this long
for bug reports to show up because NUMA machines are being used for
generic numa-unaware workloads.

> What we do now have is checking for zlc_active in the loop just so that
> the first time around we do not call zlc_setup().

Yes, why incur the cost for the common case?

> We may be able to simplify the function by:
> 1. Checking for the special case that the first zone is ok and that we do
> not want to call zlc_setup before we get to the loop.
> 2. Do the zlc_setup() before the loop.
> 3. Remove the zlc_setup() code as you did from the loop as well as the
> checks for zlc_active. zlc_active becomes not necessary since a zlc
> is always available when we go through the loop.

That initial test will involve duplication of things like the cpuset and
no watermarks check just to place the zlc_setup() in a different place.
I might be missing your point but it seems like the gain would be
marginal. Fancy posting a patch?

Mel Gorman
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at