Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages indirect reclaim

From: Dave Chinner
Date: Sun Jul 17 2011 - 22:22:43 EST


On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 00:46:43 -0400
> > Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> >
> > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > + /*
> > > > > + * Only kswapd can writeback filesystem pages to
> > > > > + * avoid risk of stack overflow
> > > > > + */
> > > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > + goto keep_locked;
> > > > > + }
> > > > > +
> > > >
> > > >
> > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > >
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > >
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > >
> >
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
>
> Here's an example writeback stack trace. Notice how deep it is from
> the __writepage() call?
....
>
> So from ->writepage, there is about 3.5k of stack usage here. 2.5k
> of that is in XFS, and the worst I've seen is around 4k before
> getting to the IO subsystem, which in the worst case I've seen
> consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> down to IO take over 6k of stack space on x86_64....

BTW, here's a stack frame that indicates swap IO:

dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (46 entries)
----- ---- --------
0) 5080 40 zone_statistics+0xad/0xc0
1) 5040 272 get_page_from_freelist+0x2ad/0x7e0
2) 4768 288 __alloc_pages_nodemask+0x133/0x7b0
3) 4480 48 kmem_getpages+0x62/0x160
4) 4432 112 cache_grow+0x2d1/0x300
5) 4320 80 cache_alloc_refill+0x219/0x260
6) 4240 64 kmem_cache_alloc+0x182/0x190
7) 4176 16 mempool_alloc_slab+0x15/0x20
8) 4160 144 mempool_alloc+0x63/0x140
9) 4016 16 scsi_sg_alloc+0x4c/0x60
10) 4000 112 __sg_alloc_table+0x66/0x140
11) 3888 32 scsi_init_sgtable+0x33/0x90
12) 3856 48 scsi_init_io+0x31/0xc0
13) 3808 32 scsi_setup_fs_cmnd+0x79/0xe0
14) 3776 112 sd_prep_fn+0x150/0xa90
15) 3664 64 blk_peek_request+0xc7/0x230
16) 3600 96 scsi_request_fn+0x68/0x500
17) 3504 16 __blk_run_queue+0x1b/0x20
18) 3488 96 __make_request+0x2cb/0x310
19) 3392 192 generic_make_request+0x26d/0x500
20) 3200 96 submit_bio+0x64/0xe0
21) 3104 48 swap_writepage+0x83/0xd0
22) 3056 112 pageout+0x122/0x2f0
23) 2944 192 shrink_page_list+0x458/0x5f0
24) 2752 192 shrink_inactive_list+0x1ec/0x410
25) 2560 224 shrink_zone+0x468/0x500
26) 2336 144 do_try_to_free_pages+0x2b7/0x3f0
27) 2192 176 try_to_free_pages+0xa4/0x120
28) 2016 288 __alloc_pages_nodemask+0x43f/0x7b0
29) 1728 48 kmem_getpages+0x62/0x160
30) 1680 128 fallback_alloc+0x192/0x240
31) 1552 96 ____cache_alloc_node+0x9a/0x170
32) 1456 16 __kmalloc+0x17d/0x200
33) 1440 128 kmem_alloc+0x77/0xf0
34) 1312 128 xfs_log_commit_cil+0x95/0x3d0
35) 1184 96 _xfs_trans_commit+0x1e9/0x2a0
36) 1088 208 xfs_create+0x57a/0x640
37) 880 96 xfs_vn_mknod+0xa1/0x1b0
38) 784 16 xfs_vn_create+0x10/0x20
39) 768 64 vfs_create+0xb1/0xe0
40) 704 96 do_last+0x5f5/0x770
41) 608 144 path_openat+0xd5/0x400
42) 464 224 do_filp_open+0x49/0xa0
43) 240 96 do_sys_open+0x107/0x1e0
44) 144 16 sys_open+0x20/0x30
45) 128 128 system_call_fastpath+0x16/0x1b


That's pretty damn bad. From kmem_alloc to the top of the stack is
more than 3.5k through the direct reclaim swap IO path. That, to me,
kind of indicates that even doing swap IO on dirty anonymous pages
from direct reclaim risks overflowing the 8k stack on x86_64....

Umm, hold on a second, WTF is my standard create-lots-of-zero-length
inodes-in-parallel doing swapping? Oh, shit, it's also running about
50% slower (50-60k files/s instead of 110-120l files/s)....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/