Re: kswapd craziness in 3.7

From: Thorsten Leemhuis
Date: Fri Nov 30 2012 - 07:37:56 EST


Johannes Weiner wrote on 29.11.2012 18:05:
> On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote:
>> Mel Gorman wrote on 29.11.2012 00:54:
>> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
>> >> On Wed, 28 Nov 2012 10:13:59 +0000
>> >> Mel Gorman <mgorman@xxxxxxx> wrote:
>> >> > Based on the reports I've seen I expect the following to work for 3.7
>> >> > Keep
>> >> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>> >> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>> >> > Revert
>> >> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>> >> > Merge
>> >> > mm: vmscan: fix kswapd endless loop on higher order allocation
>> >> > mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
>> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
>> >> myself" and when Zdenek tested it he hit an unexplained oom.
>> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
>> > Further, when he hit that OOM, it looked like a genuine OOM. He had no
>> > swap configured and inactive/active file pages were very low. Finally,
>> > the free pages for Normal looked off and could also have been affected by
>> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
>> > here. Are you thinking of something else?
>> > I have not tested with the patch admittedly but Thorsten has and seemed
>> > to be ok with it https://lkml.org/lkml/2012/11/23/276.
>> Yeah, on my two main work horses a few different kernels based on rc6 or
>> rc7 worked fine with this patch. But sorry, it seems the patch doesn't
>> fix the problems Fedora user John Ellson sees, who tried kernels I built
>> in the Fedora buildsystem. Details:
> [...]
>> I know, this makes things more complicated again; but I wanted to let
>> you guys know that some problem might still be lurking somewhere. Side
>> note: right now it seems John with kernels that contain
>> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
>> quicker (or only?) on i686 than on x86-64.
>
> Humm, highmem... Could this be the lowmem protection forcing kswapd
> to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every
> time it's woken up?
>
> This requires somebody to wake up kswapd regularly, though and from
> his report it's not quite clear to me if kswapd gets stuck or just has
> really high CPU usage while the system is still under load. The
> initial post says he would expect "<5% cpu when idling" but his top
> snippet in there shows there are other tasks running as well. So does
> it happen while the system is busy or when it's otherwise idle?
>
> [ On the other hand, not waking kswapd from THP allocations seems to
> not show this problem on his i686 machine. But it could also just
> be a tiny window of conditions aligning perfectly that drops kswapd
> in an endless loop, and the increased wakeups increase the
> probability of hitting it. So, yeah, this would be good to know. ]
>
> As the system is still responsive when this happens, any chance he
> could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
> haywire?
>
> Or even run perf record -a -g sleep 5; perf report > kswapd.txt?
>
> Preferrably with this patch applied, to rule out faulty lowmem
> protection:
>
> buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
> when figuring out whether the zone is balanced and so priority levels
> are not descended and no progress is ever made.

/me wonders how to elegantly get out of his man-in-the-middle position

John was able to reproduce the problem quickly with a kernel that
contained the patch from your mail. For details see

https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later

He provided the informations there. Parts of it:

/proc/vmstat while kswad0 at 100%cpu

nr_free_pages 196858
nr_inactive_anon 15804
nr_active_anon 65
nr_inactive_file 20792
nr_active_file 11307
nr_unevictable 0
nr_mlock 0
nr_anon_pages 14385
nr_mapped 2393
nr_file_pages 32563
nr_dirty 5
nr_writeback 0
nr_slab_reclaimable 3113
nr_slab_unreclaimable 4725
nr_page_table_pages 271
nr_kernel_stack 96
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1487
nr_vmscan_immediate_reclaim 3
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 381
nr_dirtied 388323
nr_written 361128
nr_anon_transparent_hugepages 1
nr_free_cma 0
nr_dirty_threshold 38188
nr_dirty_background_threshold 19094
pgpgin 1057223
pgpgout 1552306
pswpin 8
pswpout 1487
pgalloc_dma 5548
pgalloc_normal 10651864
pgalloc_high 2191246
pgalloc_movable 0
pgfree 13055503
pgactivate 440358
pgdeactivate 259724
pgfault 31423675
pgmajfault 3760
pgrefill_dma 2174
pgrefill_normal 212914
pgrefill_high 51755
pgrefill_movable 0
pgsteal_kswapd_dma 1
pgsteal_kswapd_normal 202106
pgsteal_kswapd_high 36515
pgsteal_kswapd_movable 0
pgsteal_direct_dma 18
pgsteal_direct_normal 0
pgsteal_direct_high 3818
pgsteal_direct_movable 0
pgscan_kswapd_dma 1
pgscan_kswapd_normal 203044
pgscan_kswapd_high 40407
pgscan_kswapd_movable 0
pgscan_direct_dma 18
pgscan_direct_normal 0
pgscan_direct_high 4409
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 0
slabs_scanned 264192
kswapd_inodesteal 171676
kswapd_low_wmark_hit_quickly 0
kswapd_high_wmark_hit_quickly 26
kswapd_skip_congestion_wait 0
pageoutrun 117729182
allocstall 5
pgrotated 1628
compact_blocks_moved 313
compact_pages_moved 7192
compact_pagemigrate_failed 265
compact_stall 13
compact_fail 9
compact_success 4
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2985
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1877
unevictable_pgs_mlocked 3965
unevictable_pgs_munlocked 3965
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 636
thp_fault_fallback 10
thp_collapse_alloc 342
thp_collapse_alloc_failed 2
thp_split 6


/proc/zoneinfo with kswapd0 at 100% cpu

Node 0, zone DMA
pages free 1655
min 196
low 245
high 294
scanned 0
spanned 4080
present 3951
nr_free_pages 1655
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 3
nr_slab_unreclaimable 1
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 315
nr_written 315
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 861, 1000, 1000)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 2
all_unreclaimable: 1
start_pfn: 16
inactive_ratio: 1
Node 0, zone Normal
pages free 186234
min 10953
low 13691
high 16429
scanned 0
spanned 222206
present 220470
nr_free_pages 186234
nr_inactive_anon 3147
nr_active_anon 2
nr_inactive_file 14064
nr_active_file 4672
nr_unevictable 0
nr_mlock 0
nr_anon_pages 3028
nr_mapped 216
nr_file_pages 18857
nr_dirty 8
nr_writeback 0
nr_slab_reclaimable 3110
nr_slab_unreclaimable 4729
nr_page_table_pages 62
nr_kernel_stack 96
nr_unstable 0
nr_bounce 0
nr_vmscan_write 311
nr_vmscan_immediate_reclaim 2
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 114
nr_dirtied 339809
nr_written 315061
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 0, 1111, 1111)
pagesets
cpu: 0
count: 81
high: 186
batch: 31
vm stats threshold: 8
all_unreclaimable: 0
start_pfn: 4096
inactive_ratio: 1
Node 0, zone HighMem
pages free 8983
min 34
low 475
high 917
scanned 0
spanned 35840
present 35560
nr_free_pages 8983
nr_inactive_anon 12661
nr_active_anon 64
nr_inactive_file 6849
nr_active_file 6500
nr_unevictable 0
nr_mlock 0
nr_anon_pages 11357
nr_mapped 2177
nr_file_pages 13692
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 209
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1176
nr_vmscan_immediate_reclaim 1
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 267
nr_dirtied 48189
nr_written 45739
nr_anon_transparent_hugepages 1
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 20
high: 42
batch: 7
vm stats threshold: 4
all_unreclaimable: 0
start_pfn: 226302
inactive_ratio: 1


First few lines of /proc/vmstat while kswad0 at 100%cpu

# ========
# captured on: Fri Nov 30 07:22:00 2012
# hostname : rawhide
# os release : 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686
# perf version : 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686
# arch : i686
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : QEMU Virtual CPU version 1.0.1
# cpuid : AuthenticAMD,6,2,3
# total memory : 1027716 kB
# cmdline : /usr/bin/perf record -g -a sleep 5
# event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, id = { 7 }
# HEADER_CPU_TOPOLOGY info available, use -I to display
# pmu mappings: software = 1, tracepoint = 2, breakpoint = 5
# ========
#
# Samples: 20K of event 'cpu-clock'
# Event count (approx.): 20016
#
# Overhead Command Shared Object Symbol
# ........ ........... ......................... ...................................
#
16.52% kswapd0 [kernel.kallsyms] [k] idr_get_next
|
--- idr_get_next
|
|--99.76%-- css_get_next
| mem_cgroup_iter
| |
| |--50.49%-- shrink_zone
| | kswapd
| | kthread
| | ret_from_kernel_thread
| |
| --49.51%-- kswapd
| kthread
| ret_from_kernel_thread
--0.24%-- [...]

11.23% kswapd0 [kernel.kallsyms] [k] prune_super
|
--- prune_super
|
|--86.74%-- shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
--13.26%-- kswapd
kthread
ret_from_kernel_thread

10.73% kswapd0 [kernel.kallsyms] [k] shrink_slab
|
--- shrink_slab
|
|--99.58%-- kswapd
| kthread
| ret_from_kernel_thread
--0.42%-- [...]

7.36% kswapd0 [kernel.kallsyms] [k] grab_super_passive
|
--- grab_super_passive
|
|--92.46%-- prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
--7.54%-- shrink_slab
kswapd
kthread
ret_from_kernel_thread

5.82% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--34.28%-- put_super
| drop_super
| prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--30.50%-- grab_super_passive
| prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--17.27%-- prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--16.15%-- drop_super
| prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--1.20%-- mb_cache_shrink_fn
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
--0.60%-- shrink_slab
kswapd
kthread
ret_from_kernel_thread

4.43% kswapd0 [kernel.kallsyms] [k] fill_contig_page_info
|
--- fill_contig_page_info
|
|--99.10%-- fragmentation_index
| compaction_suitable
| kswapd
| kthread
| ret_from_kernel_thread
|
--0.90%-- compaction_suitable
kswapd
kthread
ret_from_kernel_thread

3.81% kswapd0 [kernel.kallsyms] [k] shrink_lruvec
|
--- shrink_lruvec
|
|--99.34%-- shrink_zone
| kswapd
| kthread
| ret_from_kernel_thread
|
--0.66%-- kswapd
kthread
ret_from_kernel_thread

The rest at https://bugzilla.redhat.com/attachment.cgi?id=654977

CU
Thorsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/