Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaimand use a_ops->writepages() where possible

From: Christoph Hellwig
Date: Tue Jun 15 2010 - 10:44:19 EST

On Tue, Jun 15, 2010 at 04:22:19PM +0200, Andrea Arcangeli wrote:
> If we were forbidden to call ->writepage just because of stack
> overflow yes as I don't think it's big deal with memory compaction and
> I see this as a too limiting design to allow ->writepage only in
> kernel thread. ->writepage is also called by the pagecache layer,
> msync etc.. not just by kswapd.

Other callers of ->writepage are fine because they come from a
controlled environment with relatively little stack usage. The problem
with direct reclaim is that we splice multiple stack hogs ontop of each

Direct reclaim can come from any point that does memory allocations,
including those that absolutely have to because their stack "quota"
is almost used up. Let's look at a worst case scenario:

We're in a deep stack codepath, say

(1) core_sys_select, which has to kmalloc the array if it doesn't
fit on the huge stack variable. All fine by now, it stays in it's
stack quota.
(2) That code now calls into the slab allocator, which doesn't find free
space in the large slab, and then calls into kmem_getpages, adding
more stack usage.
(3) That calls into alloc_pages_exact_node which adds stack usage of
the page allocator.
(4) no free pages in the zone anymore, and direct reclaim is invoked,
adding the stack usage of the reclaim code, which currently is
quite heavy.
(5) direct reclaim calls into foofs ->writepage. foofs_writepage
notices the page is delayed allocated and needs to conver it.
It now has to start a transaction, then call the extent management
code to convert the extent, which calls into the space managment
code, which calls into the buffercache for the metadata buffers,
which needs to submit a bio to read/write the metadata.
(6) The metadata buffer goes through submit_bio and the block layer
code. Because we're doing a synchronous writeout it gets directly
dispatched to the block layer.
(7) for extra fun add a few remapping layers for raid or similar to
add to the stack usage.
(8) The lowlevel block driver is iscsi or something similar, so after
going through the scsi layer adding more stack it now goes through
the networking layer with tcp and ipv4 (if you're unlucky ipv6)
(9) we finally end up in the lowlevel networking driver (except that we
would have long overflown the stack)

And for extrea fun:

(10) Just when we're way down that stack an IRQ comes in on the CPU that
we're executing on. Because we don't enable irqstacks for the only
sensible stack configuration (yeah, still bitter about the patch
for that getting ignored) it goes on the same stack above.

And note that the above does not only happen with ext4/btrfs/xfs that
have delayed allocations. With every other filesystem it can also
happen, just a lot less likely - when writing to a file through shared
mmaps we still have to call the allocator from ->writepage in

And seriously, if the VM isn't stopped from calling ->writepage from
reclaim context we FS people will simply ignore any ->writepage from
reclaim context. Been there, done that and never again.

Just wondering, what filesystems do your hugepage testing systems use?
If it's any of the ext4/btrfs/xfs above you're already seeing the
filesystem refuse ->writepage from both kswapd and direct reclaim,
so Mel's series will allow us to reclaim pages from more contexts
than before.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at