Re: Free memory never fully used, swapping

From: Shaohua Li
Date: Thu Nov 25 2010 - 20:06:06 EST


On Fri, 2010-11-26 at 00:12 +0800, Mel Gorman wrote:
> On Thu, Nov 25, 2010 at 01:03:28AM -0800, Simon Kirby wrote:
> > > > <SNIP>
> > > >
> > > > This x86_64 box has 4 GB of RAM; zones are set up as follows:
> > > >
> > > > [ 0.000000] Zone PFN ranges:
> > > > [ 0.000000] DMA 0x00000001 -> 0x00001000
> > > > [ 0.000000] DMA32 0x00001000 -> 0x00100000
> > > > [ 0.000000] Normal 0x00100000 -> 0x00130000
> > > > ...
> > > > [ 0.000000] On node 0 totalpages: 1047279
> > > > [ 0.000000] DMA zone: 56 pages used for memmap
> > > > [ 0.000000] DMA zone: 0 pages reserved
> > > > [ 0.000000] DMA zone: 3943 pages, LIFO batch:0
> > > > [ 0.000000] DMA32 zone: 14280 pages used for memmap
> > > > [ 0.000000] DMA32 zone: 832392 pages, LIFO batch:31
> > > > [ 0.000000] Normal zone: 2688 pages used for memmap
> > > > [ 0.000000] Normal zone: 193920 pages, LIFO batch:31
> > > >
> > > > So, "Normal" is relatively small, and DMA32 contains most of the RAM.
>
> Ok. A consequence of this is that kswapd balancing a node will still try
> to balance Normal even if DMA32 has enough memory. This could account
> for some of kswapd being mean.
>
> > > > Watermarks from /proc/zoneinfo are:
> > > >
> > > > Node 0, zone DMA
> > > > min 7
> > > > low 8
> > > > high 10
> > > > protection: (0, 3251, 4009, 4009)
> > > > Node 0, zone DMA32
> > > > min 1640
> > > > low 2050
> > > > high 2460
> > > > protection: (0, 0, 757, 757)
> > > > Node 0, zone Normal
> > > > min 382
> > > > low 477
> > > > high 573
> > > > protection: (0, 0, 0, 0)
> > > >
> > > > This box has a couple bnx2 NICs, which do about 60 Mbps each. Jumbo
> > > > frames were disabled for now (to try to stop big order allocations), but
> > > > this did not stop atomic allocations of order 3 coming in, as found with:
> > > >
> > > > perf record --event kmem:mm_page_alloc --filter 'order>=3' -a --call-graph -c 1 -a sleep 10
> > > > perf report
> > > >
> > > > __alloc_pages_nodemask
> > > > alloc_pages_current
> > > > new_slab
> > > > __slab_alloc
> > > > __kmalloc_node_track_caller
> > > > __alloc_skb
> > > > __netdev_alloc_skb
> > > > bnx2_poll_work
> > > >
> > > > From my reading of this, it seems like __alloc_skb uses kmalloc(), and
> > > > kmalloc uses the kmalloc slab unless (unlikely(size > SLUB_MAX_SIZE)),
> > > > where SLUB_MAX_SIZE is 2 * PAGE_SIZE, in which case kmalloc_large is
> > > > called which allocates pages directly. This means that reception of
> > > > jumbo frames probably actually results in (consistent) smaller order
> > > > allocations! Anyway, these GFP_ATOMIC allocations don't seem to be
> > > > failing, BUT...
> > > >
>
> It's possible to reduce the maximum order that SLUB uses but lets not
> resort to that as a workaround just yet. In case it needs to be
> elminiated as a source of problems later, the relevant kernel parameter
> is slub_max_order=.
>
> > > > Right after kswapd goes to sleep, we're left with DMA32 with 421k or so
> > > > free pages, and Normal with 20k or so free pages (about 1.8 GB free).
> > > >
> > > > Immediately, zone Normal starts being used until it reaches about 468
> > > > pages free in order 0, nothing else free. kswapd is not woken here,
> > > > but allocations just start coming from zone DMA32 instead.
>
> kswapd is not woken up because we stay in the allocator fastpath once
> that much memory hs been freed.
>
> > > > While this
> > > > happens, the occasional order=3 allocations coming in via the slab from
> > > > __alloc_skb seem to be picking away at the available order=3 chunks.
> > > > /proc/buddyinfo shows that there are 10k or so when it starts, so this
> > > > succeeds easily.
> > > >
> > > > After a minute or so, available order-3 start reaching a lower number,
> > > > like 20 or so. order-4 then starts dropping as it is split into order-3,
> > > > until it reaches 20 or so as well. Then, order-3 hits 0, and kswapd is
> > > > woken.
>
> Allocator slowpath.
>
> > > > When this occurs, there are still a few order-5, order-6, etc.,
> > > > available.
>
> Watermarks are probably not met though.
>
> > > > I presume the GFP_ATOMIC allocation can still split buddies
> > > > here, still making order-3 available without sleeping, because there is
> > > > no allocation failure message that I can see.
> > > >
>
> Technically it could, but watermark maintenance is important.
>
> > > > Here is a "while true; do sleep 1; grep -v 'DMA ' /proc/buddyinfo; done"
> > > > ("DMA" zone is totally untouched, always, so excluded; white space
> > > > crushed to avoid wrapping), while it happens:
> > > >
> > > > Node 0, zone DMA 2 1 1 2 1 1 1 0 1 1 3
> > > > Node 0, zone DMA32 25770 29441 14512 10426 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > ...
> > > > Node 0, zone DMA32 23343 29405 6062 6478 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 23187 29358 6047 5960 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 23000 29372 6047 5411 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 22714 29391 6076 4225 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 22354 29459 6059 3178 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 22202 29388 6035 2395 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 21971 29411 6036 1032 1901 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 21514 29388 6019 433 1796 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 21334 29387 6019 240 1464 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 21237 29421 6052 216 1336 123 4 0 0 0 0
> > > > Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 20968 29378 6020 244 751 123 4 0 0 0 0
> > > > Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 20741 29383 6022 134 272 123 4 0 0 0 0
> > > > Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 20476 29370 6024 117 48 116 4 0 0 0 0
> > > > Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 20343 29369 6020 110 23 10 2 0 0 0 0
> > > > Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 21592 30477 4856 22 10 4 2 0 0 0 0
> > > > Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 24388 33261 1985 6 10 4 2 0 0 0 0
> > > > Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 25358 34080 1068 0 4 4 2 0 0 0 0
> > > > Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 75985 68954 5345 87 1 4 2 0 0 0 0
> > > > Node 0, zone Normal 18249 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81117 71630 19261 429 3 4 2 0 0 0 0
> > > > Node 0, zone Normal 17908 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81226 71299 21038 569 19 4 2 0 0 0 0
> > > > Node 0, zone Normal 18559 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81347 71278 21068 640 19 4 2 0 0 0 0
> > > > Node 0, zone Normal 17928 21 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81370 71237 21241 1073 29 4 2 0 0 0 0
> > > > Node 0, zone Normal 18187 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81401 71237 21314 1139 29 4 2 0 0 0 0
> > > > Node 0, zone Normal 16978 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81410 71239 21314 1145 29 4 2 0 0 0 0
> > > > Node 0, zone Normal 18156 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81419 71232 21317 1160 30 4 2 0 0 0 0
> > > > Node 0, zone Normal 17536 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81347 71144 21443 1160 31 4 2 0 0 0 0
> > > > Node 0, zone Normal 18483 7 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81300 71059 21556 1178 38 4 2 0 0 0 0
> > > > Node 0, zone Normal 18528 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81315 71042 21577 1180 39 4 2 0 0 0 0
> > > > Node 0, zone Normal 18431 2 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81301 71002 21702 1202 39 4 2 0 0 0 0
> > > > Node 0, zone Normal 18487 5 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81301 70998 21702 1202 39 4 2 0 0 0 0
> > > > Node 0, zone Normal 18311 0 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81296 71025 21711 1208 45 4 2 0 0 0 0
> > > > Node 0, zone Normal 17092 5 0 0 0 0 0 0 0 0 0
> > > > Node 0, zone DMA32 81299 71023 21716 1226 45 4 2 0 0 0 0
> > > > Node 0, zone Normal 18225 12 0 0 0 0 0 0 0 0 0
> > > >
> > > > Running a perf record on the kswapd wakeup right when it happens shows:
> > > > perf record --event vmscan:mm_vmscan_wakeup_kswapd -a --call-graph -c 1 -a sleep 10
> > > > perf trace
> > > > swapper-0 [002] 1323136.979119: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > > > swapper-0 [002] 1323136.979131: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > > > lmtp-20593 [003] 1323136.984066: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > > > lmtp-20593 [003] 1323136.984079: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > > > swapper-0 [001] 1323136.985511: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > > > swapper-0 [001] 1323136.985515: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > > > lmtp-20593 [003] 1323136.985673: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > > > lmtp-20593 [003] 1323136.985675: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > > >
> > > > This causes kswapd to throw out a bunch of stuff from Normal and from
> > > > DMA32, to try to get zone_watermark_ok() to be happy for order=3.
>
> Yep.
>
> > > > However, we have a heavy read load from all of the email stored on SSDs
> > > > on this box, and kswapd ends up fighting to try to keep reclaiming the
> > > > allocations (mostly order-0). During the whole day, it never wins -- the
> > > > allocations are faster. At night, it wins after a minute or two. The
> > > > fighting is happening in all of the lines after it awakes above.
> > > >
>
> It's probably fighting to keep *all* zones happy even though it's not strictly
> necessary. I suspect it's fighting the most for Normal.
>
> > > > slabs_scanned, kswapd_steal, kswapd_inodesteal (slowly),
> > > > kswapd_skip_congestion_wait, and pageoutrun go up in vmstat while kswapd
> > > > is running. With the box up for 15 days, you can see it struggling on
> > > > pgscan_kswapd_normal (from /proc/vmstat):
> > > >
> > > > pgfree 3329793080
> > > > pgactivate 643476431
> > > > pgdeactivate 155182710
> > > > pgfault 2649106647
> > > > pgmajfault 58157157
> > > > pgrefill_dma 0
> > > > pgrefill_dma32 19688032
> > > > pgrefill_normal 7600864
> > > > pgrefill_movable 0
> > > > pgsteal_dma 0
> > > > pgsteal_dma32 465191578
> > > > pgsteal_normal 651178518
> > > > pgsteal_movable 0
> > > > pgscan_kswapd_dma 0
> > > > pgscan_kswapd_dma32 768300403
> > > > pgscan_kswapd_normal 34614572907
> > > > pgscan_kswapd_movable 0
> > > > pgscan_direct_dma 0
> > > > pgscan_direct_dma32 2853983
> > > > pgscan_direct_normal 885799
> > > > pgscan_direct_movable 0
> > > > pginodesteal 191895
> > > > pgrotated 27290463
> > > >
> > > > So, here are my questions.
> > > >
> > > > Why do we care about order > 0 watermarks at all in the Normal zone?
> > > > Wouldn't it make a lot more sense to just make the DMA32 zone the only
> > > > one we care about for larger-order allocations? Or is this required for
> > > > the hugepage stuff?
> > > >
>
> It's not required. The logic for kswapd is "balance all zones" and
> Normal is one of the zones. Even though you know that DMA32 is just
> fine, kswapd doesn't.
>
> > > > The fact that so much stuff is evicted just because order-3 hits 0 is
> > > > crazy, especially when larger order pages are still free. It seems like
> > > > we're trying to keep large orders free here. Why?
>
> Watermarks. The steady stream of order-3 allocations is telling the
> allocator and kswapd that these size pages must be available. It doesn't
> know that slub can happily fall back to smaller pages because that
> information is lost. Even removing __GFP_WAIT won't help because kswapd
> still gets woken up for atomic allocation requests.
>
> > > > Maybe things would be
> > > > better if kswapd does not reclaim at all unless the requested order is
> > > > empty _and_ all orders above are empty. This would require hugepage
> > > > users to use CONFIG_COMPACT, and have _compaction_ occur the way the
> > > > watermark checks work now, but people without CONFIG_HUGETLB_PAGE could
> > > > just actually use the memory. Would this work?
> > > >
> > > > There is logic at the end of balance_pgdat() to give up balancing order>0
> > > > and just try another loop with order = 0 if sc.nr_reclaimed is <
> > > > SWAP_CLUSTER_MAX. However, when this order=0 pass returns, the caller of
> > > > balance_pgdat(), kswapd(), gets true from sleeping_prematurely() and just
> > > > calls right back to balance_pgdat() again. I think this is why this
> > > > logic doesn't seem to work here.
> > > >
>
> Ok, this is true. kswapd in balance_pgdat() has given up on the order
> but that information is lost when sleeping_prematurely() is called so it
> constantly loops. That is a mistake. balance_pgdat() could return the order
> so sleeping_prematurely() doesn't do the wrong thing.
>
> > > > Is my assumption about GFP_ATOMIC order=3 working even when order 3 is
> > > > empty, but order>3 is not? Regardless, shouldn't kswapd be woken before
> > > > order 3 is 0 since it may have nothing above order 3 to split from, thus
> > > > actually causing an allocation failure? Does something else do this?
> > >
> > > even kswapd is woken after order>3 is empty, the issue will occur since
> > > the order > 3 pages will be used soon and kswapd still needs to reclaim
> > > some pages. So the issue is there is high order page allocation and
> > > lumpy reclaim wrongly reclaims some pages. maybe you should use slab
> > > instead of slub to avoid high order allocation.
> >
> > There are actually a few problems here. I think they are worth looking
> > at them separately, unless "don't use order 3 allocations" is a valid
> > statement, in which case we should fix slub.
> >
>
> SLUB can be forced to use smaller orders but I don't think that's the
> right fix here.
>
> > The funny thing here is that slub.c's allocate_slab() calls alloc_pages()
> > with flags | __GFP_NOWARN | __GFP_NORETRY, and intentionally tries a
> > lower order allocation automatically if it fails. This is why there is
> > no allocation failure warning when this happens. However, it is too late
> > -- kswapd is woken and it ties to bring order 3 up to the watermark.
> > If we hacked __alloc_pages_slowpath() to not wake kswapd when
> > __GFP_NOWARN is set, we would never see this problem and the slub
> > optimization might still mostly work.
>
> Yes, but we'd see more high-order atomic allocation (e.g. jumbo frames)
> failures as a result so that fix would cause other regressions.
>
> > Either way, we should "fix" slub
> > or "fix" order-3 allocations, so that other people who are using slub
> > don't hit the same problem.
> >
> > kswapd is throwing out many times what is needed for the order 3
> > watermark to be met. It seems to be not as bad now, but look at these
> > pages being reclaimed (200ms intervals, whitespace-packed buddyinfo
> > followed by nr_pages_free calculation and final order-3 watermark test,
> > kswapd woken after the second sample):
> >
> > Zone order:0 1 2 3 4 5 6 7 8 9 A nr_free or3-low-chk
> >
> > DMA32 20374 35116 975 1 2 5 1 0 0 0 0 94770 257 <= 256
> > DMA32 20480 35211 870 1 1 5 1 0 0 0 0 94630 241 <= 256
> > (kswapd wakes, gobble gobble)
> > DMA32 24387 37009 2910 297 100 5 1 0 0 0 0 114245 4193 <= 256
> > DMA32 36169 37787 4676 637 110 5 1 0 0 0 0 137527 7073 <= 256
> > DMA32 63443 40620 5716 982 144 5 1 0 0 0 0 177931 10377 <= 256
> > DMA32 65866 57006 6462 1180 158 5 1 0 0 0 0 217918 12185 <= 256
> > DMA32 67188 66779 9328 1893 208 5 1 0 0 0 0 256754 18689 <= 256
> > DMA32 67909 67356 18307 2268 235 5 1 0 0 0 0 297977 22121 <= 256
> > DMA32 68333 67419 20786 4192 298 7 1 0 0 0 0 324907 38585 <= 256
> > DMA32 69872 68096 21580 5141 326 7 1 0 0 0 0 339016 46625 <= 256
> > DMA32 69959 67970 22339 5657 371 10 1 0 0 0 0 346831 51569 <= 256
> > DMA32 70017 67946 22363 6078 417 11 1 0 0 0 0 351073 55705 <= 256
> > DMA32 70023 67949 22376 6204 439 12 1 0 0 0 0 352529 57097 <= 256
> > DMA32 70045 67937 22380 6262 451 12 1 0 0 0 0 353199 57753 <= 256
> > DMA32 70062 67939 22378 6298 456 12 1 0 0 0 0 353580 58121 <= 256
> > DMA32 70079 67959 22388 6370 458 12 1 0 0 0 0 354285 58729 <= 256
> > DMA32 70079 67959 22388 6387 460 12 1 0 0 0 0 354453 58897 <= 256
> > DMA32 70076 67954 22387 6393 460 12 1 0 0 0 0 354484 58945 <= 256
> > DMA32 70105 67975 22385 6466 468 12 1 0 0 0 0 355259 59657 <= 256
> > DMA32 70110 67972 22387 6466 470 12 1 0 0 0 0 355298 59689 <= 256
> > DMA32 70152 67989 22393 6476 470 12 1 0 0 0 0 355478 59769 <= 256
> > DMA32 70175 67991 22401 6493 471 12 1 0 0 0 0 355689 59921 <= 256
> > DMA32 70175 67991 22401 6493 471 12 1 0 0 0 0 355689 59921 <= 256
> > DMA32 70175 67991 22401 6493 471 12 1 0 0 0 0 355689 59921 <= 256
> > DMA32 70192 67990 22401 6495 471 12 1 0 0 0 0 355720 59937 <= 256
> > DMA32 70192 67988 22401 6496 471 12 1 0 0 0 0 355724 59945 <= 256
> > DMA32 70099 68061 22467 6602 477 12 1 0 0 0 0 356985 60889 <= 256
> > DMA32 70099 68062 22467 6602 477 12 1 0 0 0 0 356987 60889 <= 256
> > DMA32 70099 68062 22467 6602 477 12 1 0 0 0 0 356987 60889 <= 256
> > DMA32 70099 68062 22467 6603 477 12 1 0 0 0 0 356995 60897 <= 256
> > (kswapd sleeps)
> >
> > Normal zone at the same time (shown separately for clarity):
> >
> > Normal 452 1 0 0 0 0 0 0 0 0 0 454 -5 <= 238
> > Normal 452 1 0 0 0 0 0 0 0 0 0 454 -5 <= 238
> > (kswapd wakes)
> > Normal 7618 76 0 0 0 0 0 0 0 0 0 7770 145 <= 238
> > Normal 8860 73 1 0 0 0 0 0 0 0 0 9010 143 <= 238
> > Normal 8929 25 0 0 0 0 0 0 0 0 0 8979 43 <= 238
> > Normal 8917 0 0 0 0 0 0 0 0 0 0 8917 -7 <= 238
> > Normal 8978 16 0 0 0 0 0 0 0 0 0 9010 25 <= 238
> > Normal 9064 4 0 0 0 0 0 0 0 0 0 9072 1 <= 238
> > Normal 9068 2 0 0 0 0 0 0 0 0 0 9072 -3 <= 238
> > Normal 8992 9 0 0 0 0 0 0 0 0 0 9010 11 <= 238
> > Normal 9060 6 0 0 0 0 0 0 0 0 0 9072 5 <= 238
> > Normal 9010 0 0 0 0 0 0 0 0 0 0 9010 -7 <= 238
> > Normal 8907 5 0 0 0 0 0 0 0 0 0 8917 3 <= 238
> > Normal 8576 0 0 0 0 0 0 0 0 0 0 8576 -7 <= 238
> > Normal 8018 0 0 0 0 0 0 0 0 0 0 8018 -7 <= 238
> > Normal 6778 0 0 0 0 0 0 0 0 0 0 6778 -7 <= 238
> > Normal 6189 0 0 0 0 0 0 0 0 0 0 6189 -7 <= 238
> > Normal 6220 0 0 0 0 0 0 0 0 0 0 6220 -7 <= 238
> > Normal 6096 0 0 0 0 0 0 0 0 0 0 6096 -7 <= 238
> > Normal 6251 0 0 0 0 0 0 0 0 0 0 6251 -7 <= 238
> > Normal 6127 0 0 0 0 0 0 0 0 0 0 6127 -7 <= 238
> > Normal 6218 1 0 0 0 0 0 0 0 0 0 6220 -5 <= 238
> > Normal 6034 0 0 0 0 0 0 0 0 0 0 6034 -7 <= 238
> > Normal 6065 0 0 0 0 0 0 0 0 0 0 6065 -7 <= 238
> > Normal 6189 0 0 0 0 0 0 0 0 0 0 6189 -7 <= 238
> > Normal 6189 0 0 0 0 0 0 0 0 0 0 6189 -7 <= 238
> > Normal 6096 0 0 0 0 0 0 0 0 0 0 6096 -7 <= 238
> > Normal 6127 0 0 0 0 0 0 0 0 0 0 6127 -7 <= 238
> > Normal 6158 0 0 0 0 0 0 0 0 0 0 6158 -7 <= 238
> > Normal 6127 0 0 0 0 0 0 0 0 0 0 6127 -7 <= 238
> > (kswapd sleeps -- maybe too much turkey)
> >
> > DMA32 get so much reclaimed that the watermark test succeeded long ago.
> > Meanwhile, Normal is being reclaimed as well, but because it's fighting
> > with allocations, it tries for a while and eventually succeeds (I think),
> > but the 200ms samples didn't catch it.
> >
>
> So, the key here is kswapd didn't need to balance all zones, any one of
> them would have been fine.
>
> > KOSAKI Motohiro, I'm interested in your commit 73ce02e9. This seems
> > to be similar to this problem, but your change is not working here.
>
> It's not because sleeping_prematurely() interferes with it.
>
> > We're seeing kswapd run without sleeping, KSWAPD_SKIP_CONGESTION_WAIT
> > is increasing (so has_under_min_watermark_zone is true), and pageoutrun
> > increasing all the time. This means that balance_pgdat() keeps being
> > called, but sleeping_prematurely() is returning true, so kswapd() just
> > keeps re-calling balance_pgdat(). If your approach is correct to stop
> > kswapd here, the problem seems to be that balance_pgdat's copy of order
> > and sc.order is being set to 0, but not pgdat->kswapd_max_order, so
> > kswapd never really sleeps. How is this supposed to work?
> >
>
> It doesn't.
>
> > Our allocation load here is mostly file pages, some anon pages, and
> > relatively little slab and anything else.
> >
>
> I think there are at least two fixes required here.
>
> 1. sleeping_prematurely() must be aware that balance_pgdat() has dropped
> the order.
> 2. kswapd is trying to balance all zones for higher orders even though
> it doesn't really have to.
>
> This patch has potential fixes for both of these problems. I have a split-out
> series but I'm posting it as a single patch so see if it allows kswapd to
> go to sleep as expected for you and whether it stops hammering the Normal
> zone unnecessarily. I tested it locally here (albeit with compaction
> enabled) and it did reduce the amount of time kswapd spent awake.
>
> ==== CUT HERE ====
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 39c24eb..25fe08d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -645,6 +645,7 @@ typedef struct pglist_data {
> wait_queue_head_t kswapd_wait;
> struct task_struct *kswapd;
> int kswapd_max_order;
> + enum zone_type high_zoneidx;
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> @@ -660,7 +661,7 @@ typedef struct pglist_data {
>
> extern struct mutex zonelists_mutex;
> void build_all_zonelists(void *data);
> -void wakeup_kswapd(struct zone *zone, int order);
> +void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx);
> int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> int classzone_idx, int alloc_flags);
> enum memmap_context {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 07a6544..344b597 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1921,7 +1921,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> struct zone *zone;
>
> for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - wakeup_kswapd(zone, order);
> + wakeup_kswapd(zone, order, high_zoneidx);
> }
>
> static inline int
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d31d7ce..00529a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2118,15 +2118,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> #endif
>
> /* is kswapd sleeping prematurely? */
> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> +static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> {
> int i;
> + bool all_zones_ok = true;
> + bool any_zone_ok = false;
>
> /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> if (remaining)
> return 1;
>
> - /* If after HZ/10, a zone is below the high mark, it's premature */
> + /* Check the watermark levels */
> for (i = 0; i < pgdat->nr_zones; i++) {
> struct zone *zone = pgdat->node_zones + i;
>
> @@ -2138,10 +2140,20 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>
> if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> 0, 0))
> - return 1;
> + all_zones_ok = false;
> + else
> + any_zone_ok = true;
> }
>
> - return 0;
> + /*
> + * For high-order requests, any zone meeting the watermark is enough
> + * to allow kswapd go back to sleep
> + * For order-0, all zones must be balanced
> + */
> + if (order)
> + return !any_zone_ok;
> + else
> + return !all_zones_ok;
> }
>
> /*
> @@ -2168,6 +2180,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> {
> int all_zones_ok;
> + int any_zone_ok;
> int priority;
> int i;
> unsigned long total_scanned;
> @@ -2201,6 +2214,7 @@ loop_again:
> disable_swap_token();
>
> all_zones_ok = 1;
> + any_zone_ok = 0;
>
> /*
> * Scan in the highmem->dma direction for the highest
> @@ -2310,10 +2324,12 @@ loop_again:
> * spectulatively avoid congestion waits
> */
> zone_clear_flag(zone, ZONE_CONGESTED);
> + if (i <= pgdat->high_zoneidx)
> + any_zone_ok = 1;
> }
>
> }
> - if (all_zones_ok)
> + if (all_zones_ok || (order && any_zone_ok))
> break; /* kswapd: all done */
> /*
> * OK, kswapd is getting into trouble. Take a nap, then take
> @@ -2336,7 +2352,7 @@ loop_again:
> break;
> }
> out:
> - if (!all_zones_ok) {
> + if (!(all_zones_ok || (order && any_zone_ok))) {
> cond_resched();
>
> try_to_freeze();
> @@ -2361,7 +2377,13 @@ out:
> goto loop_again;
> }
>
> - return sc.nr_reclaimed;
> + /*
> + * Return the order we were reclaiming at so sleeping_prematurely()
> + * makes a decision on the order we were last reclaiming at. However,
> + * if another caller entered the allocator slow path while kswapd
> + * was awake, order will remain at the higher level
> + */
> + return order;
> }
This seems always fail. because you have the protect in the kswapd side,
but no in the page allocation side. so every time a high order
allocation occurs, the protect breaks and kswapd keeps running.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/