Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaimand use a_ops->writepages() where possible

From: KAMEZAWA Hiroyuki
Date: Tue Jun 08 2010 - 22:56:47 EST

On Tue, 8 Jun 2010 10:02:19 +0100
Mel Gorman <mel@xxxxxxxxx> wrote:

> I finally got a chance last week to visit the topic of direct reclaim
> avoiding the writing out pages. As it came up during discussions the last
> time, I also had a stab at making the VM writing ranges of pages instead
> of individual pages. I am not proposing for merging yet until I want to see
> what people think of this general direction and if we can agree on if this
> is the right one or not.
> To summarise, there are two big problems with page reclaim right now. The
> first is that page reclaim uses a_op->writepage to write a back back
> under the page lock which is inefficient from an IO perspective due to
> seeky patterns. The second is that direct reclaim calling the filesystem
> splices two potentially deep call paths together and potentially overflows
> the stack on complex storage or filesystems. This series is an early draft
> at tackling both of these problems and is in three stages.
> The first 4 patches are a forward-port of trace points that are partly
> based on trace points defined by Larry Woodman but never merged. They trace
> parts of kswapd, direct reclaim, LRU page isolation and page writeback. The
> tracepoints can be used to evaluate what is happening within reclaim and
> whether things are getting better or worse. They do not have to be part of
> the final series but might be useful during discussion.
> Patch 5 writes out contiguous ranges of pages where possible using
> a_ops->writepages. When writing a range, the inode is pinned and the page
> lock released before submitting to writepages(). This potentially generates
> a better IO pattern and it should avoid a lock inversion problem within the
> filesystem that wants the same page lock held by the VM. The downside with
> writing ranges is that the VM may not be generating more IO than necessary.
> Patch 6 prevents direct reclaim writing out pages at all and instead dirty
> pages are put back on the LRU. For lumpy reclaim, the caller will briefly
> wait on dirty pages to be written out before trying to reclaim the dirty
> pages a second time.
> The last patch increases the responsibility of kswapd somewhat because
> it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
> a better fit than background flushers to clean pages as it knows where the
> pages needing cleaning are. As it's async IO, it should not cause kswapd to
> stall (at least until the queue is congested) but the order that pages are
> reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
> by direct reclaimers are getting another lap on the LRU. The dirty pages
> could have been put on a dedicated list but this increased counter overhead
> and the number of lists and it is unclear if it is necessary.
> The series has survived performance and stress testing, particularly around
> high-order allocations on X86, X86-64 and PPC64. The results of the tests
> showed that while lumpy reclaim has a slightly lower success rate when
> allocating huge pages but it was still very acceptable rates, reclaim was
> a lot less disruptive and allocation latency was lower.
> Comments?

My concern is how memcg should work. IOW, what changes will be necessary for
memcg to work with the new vmscan logic as no-direct-writeback.

Maybe an ideal solution will be
- support buffered I/O tracking in I/O cgroup.
- flusher threads should work with I/O cgroup.
- memcg itself should support dirty ratio. and add a trigger to kick flusher
threads for dirty pages in a memcg.
But I know it's a long way.

How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
memcg has to wait for a flusher thread make pages clean ?
Or memcg should have kswapd-for-memcg ?

Is it okay to call writeback directly when !scanning_global_lru() ?
memcg's reclaim routine is only called from specific positions, so, I guess
no stack problem. But we just have I/O pattern problem.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at