Re: Kswapd in 3.2.0-rc5 is a CPU hog

From: Minchan Kim
Date: Mon Dec 26 2011 - 22:58:07 EST

Next message: Sujit Reddy Thumma: "Re: [PATCH] mmc: use usleep_range() in mmc_delay()"
Previous message: Axel Lin: "[PATCH] i2c: Convert to DEFINE_PCI_DEVICE_TABLE"
In reply to: KOSAKI Motohiro: "Re: Kswapd in 3.2.0-rc5 is a CPU hog"
Next in thread: KAMEZAWA Hiroyuki: "Re: Kswapd in 3.2.0-rc5 is a CPU hog"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Dec 27, 2011 at 11:15:43AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Dec 2011 07:45:03 +1100
> Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > Ð ÐÑ., 23/12/2011 Ð 21:20 +1100, Dave Chinner ÐÐÑÐÑ:
> > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > Ð ÐÑ., 22/12/2011 Ð 09:55 +1100, Dave Chinner ÐÐÑÐÑ:
> > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
>
> > > Here is the report of trace-cmd while dd'ing
> > > https://80.237.6.56/report-dd.xz
> >
> > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > by kswapd. The pattern is:
> >
> > - reclaim 94 (batches of 32,32,30) pages from iinactive list
> > of zone 1, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > 40us gap
> >
> > And it just repeats over and over again. After a while, nid=0,zone=1
> > drops out of the traces, so reclaim only comes in batches of 10-30
> > pages from zone 2 between each shrink_slab() call.
> >
> > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > 2,000 pages/s) which is awfully slow....
> >
> > vmscan gurus - time for you to step in now...
> >
>
> Can you show /proc/zoneinfo ? I want to know each zone's size.
>
> Below is my memo.
>
> In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
> to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
>
> Seeing balance_pgdat() and trace log, I guess it does
>
> wake up
>
> shrink_zone(zone=0(DMA?)) => nothing to reclaim.
> shrink_slab()
> shrink_zone(zone=1(DMA32?)) => reclaim 32,32,31 pages
> shrink_slab()
> shrink_zone(zone=2(NORMAL?)) => reclaim 13 pages.
> srhink_slab()
>
> sleep or retry.
>
> Why shrink_slab() need to be called frequently like this ?

I guess it's caused by small NORMAL zone.
The scenario I think is as follows,

1. dd comsumes memory in NORMAL zone
2. dd enter direct reclaim and wakeup kswapd
3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
4. schedule
5. dd consumes memory again in NORMAL zone
6. kswapd fail to reclaim memory by high watermark due to 5.
7. loop again, goto 3.

The point is speed between reclaim VS memory consumption.
So kswapd cannot reach a point which enough pages are in NORMAL zone.

>
> BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> while 'dd' operation ? (no direct relcaim by dd.)
> Is this log record cpu hog after 'dd' ?

If above scenario is right, dd couldn't enter direct reclaim to reclaim memory.

>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sujit Reddy Thumma: "Re: [PATCH] mmc: use usleep_range() in mmc_delay()"
Previous message: Axel Lin: "[PATCH] i2c: Convert to DEFINE_PCI_DEVICE_TABLE"
In reply to: KOSAKI Motohiro: "Re: Kswapd in 3.2.0-rc5 is a CPU hog"
Next in thread: KAMEZAWA Hiroyuki: "Re: Kswapd in 3.2.0-rc5 is a CPU hog"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]