Re: [PATCH 0/4] Reclaim page capture v3

From: Nick Piggin
Date: Mon Sep 08 2008 - 23:32:18 EST

On Tuesday 09 September 2008 02:41, Andy Whitcroft wrote:
> On Mon, Sep 08, 2008 at 11:59:54PM +1000, Nick Piggin wrote:

> > So... what does the non-simulation version (ie. the real app) say?
> In the common use model, use of huge pages is all or nothing. Either
> there are sufficient pages allocatable at application start time or there
> are not. As huge pages are not swappable once allocated they stay there.
> Applications either start using huge pages or they fallback to small pages
> and continue. This makes the real metric, how often does the customer get
> annoyed becuase their application has fallen back to small pages and is
> slow, or how often does their database fail to start. It is very hard to
> directly measure that and thus to get a comparitive figure. Any attempt
> to replicate that seems as necessarly artificial as the current test.

But you have customers telling you they're getting annoyed because of
this? Or you have your own "realistic" workloads that allocate hugepages
on demand (OK, when I say realistic, something like specjbb or whatever
is obviously a reasonable macrobenchmark even if it isn't strictly

> > *Much* less likely, actually. Because there should be very little
> > allocation required for reclaim (only dirty pages, and only when backed
> > by filesystems that do silly things like not ensuring their own reserves
> > before allowing the page to be dirtied).
> Well yes and no. A lot of filesystems do do such stupid things.
> Allocating things like journal pages which have relativly long lives
> during reclaim. We have seen these getting placed into the memory we
> have just freed and preventing higher order coelesce.

They shouldn't, because that's sad and deadlocky. But yes I agree it
happens sometimes.

> > Also, your scheme still doesn't avoid allocation for reclaim so I don't
> > see how you can use that as a point against queueing but not capturing!
> Obviously we cannot prevent allocations during reclaim. But we can avoid
> those allocations falling within the captured range. All pages under
> capture are marked. Any page returning to the allocator that merges with a
> buddy under capture, or that is a buddy under capture are kept separatly.
> Such that any allocations from within reclaim will necessarily take
> pages from elsewhere in the pool. The key thing about capture is that it
> effectivly marks ranges of pages out of use for allocations for the period
> of the reclaim, so we have a number of small ranges blocked out, not the
> whole pool. This allows parallel allocations (reclaim and otherwise)
> to succeed against the reserves (refilled by kswapd etc), whilst marking
> the pages under capture out and preventing them from being used.

Yeah, but blocking the whole pool gives a *much* bigger chance to coalesce
freed pages. And I'm not just talking about massive order-10 allocations
or something where you have the targetted reclaim which improves chances of
getting a free page within that range, but also for order-1/2/3 pages
that might be more commonly used in normal kernel workloads but can still
have a fairly high latency to succeed if there is a lot of other parallel
allocation happening.

> > I don't see why it should be unfair to allow a process to allocate 1024
> > order-0 pages ahead of one order-10 page (actually, yes the order 10 page
> > is I guess somewhat more valueable than the same number of fragmented
> > pages, but you see where I'm coming from).
> I think we have our wires crossed here. I was saying it would seem
> unfair to block the allocator from giving out order-0 pages while we are
> struggling to get an order-10 page for one process. Having a queue
> would seem to generate such behaviour. What I am trying to achieve with
> capture is to push areas likely to return us a page of the requested
> size out of use, while we try and reclaim it without blocking the rest
> of the allocator.

We don't have our wires crossed. I just don't agree that it is unfair.
It might be unfair to allow order-10 allocations at the same *rate* at
order-0 allocations, which is why you could allow some priority in the
queue. But when you decide you want to satisfy an order-10 allocation,
do you want the other guys potentially mopping up your coalescing
candidates? (and right, for targetted reclaim, maybe this is less of a
problem, but I'm also thinking about a general solution for all orders
of pages not just hugepages).

> > At least with queueing you are basing the idea on *some* reasonable
> > policy, rather than purely random "whoever happens to take this lock at
> > the right time will win" strategy, which might I add, can even be much
> > more unfair than you might say queueing is.
> >
> > However, queueing would still be able to allow some flexibility in
> > priority. For example:
> >
> > if (must_queue) {
> > if (queue_head_prio == 0)
> > join_queue(1<<order);
> > else {
> > queue_head_prio -= 1<<order;
> > skip_queue();
> > }
> > }
> Sure you would be able to make some kind of more flexible decisions, but
> that still seems like a heavy handed approach. You are important enough
> to takes pages (possibly from our mostly free high order page) or not.

I don't understand the thought process that leads you to these assertions.
Where do you get your idea of fairness or importance?

I would say that allowing 2^N order-0 allocations for every order-N
allocations if both allocators are in a tight loop (and reserves are low,
ie. reclaim is required) is a completely reasonable starting point for
fairness. Do you disagree with that? How is it less fair than your

> In a perfect world we would be able to know in advance that an order-N
> region would definatly come free if reclaimed and allocate that preemptivly
> to the requestor, apply reclaim to it, and then actually allocate the page.

But with my queueing you get effectively the same thing without having an
oracle. Because you will wait for *any* order-N region to become free.

The "tradeoff" is blocking other allocators. But IMO that is actually a
good thing because that equates to a general fairness model in our allocator
for all allocation types in place of the existing, fairly random game of
chance (especially for order-2/3+) that we have now.

Even for hugepages: if, with your capture patches, if process 0 comes in
and does all this reclaim work and has nearly freed up some linear region
of memory; then do you think it is reasonable if process-1 happens to come
in and get lucky and find an subsequently coalesced hugepage and allocate
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at