Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

From: Mel Gorman
Date: Thu Sep 02 2010 - 14:28:31 EST


On Thu, Sep 02, 2010 at 05:49:54PM +0200, Johannes Weiner wrote:
> On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote:
> > On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> > > Removing congestion_wait() in do_try_to_free_pages() definitely
> > > worsens reclaim behaviour for this workload:
> > >
> > > 1. wallclock time of the testrun increases by 11%
> > >
> > > 2. the scanners do a worse job and go for the wrong zone:
> > >
> > > -pgalloc_dma 79597
> > > -pgalloc_dma32 134465902
> > > +pgalloc_dma 297089
> > > +pgalloc_dma32 134247237
> > >
> > > -pgsteal_dma 77501
> > > -pgsteal_dma32 133939446
> > > +pgsteal_dma 294998
> > > +pgsteal_dma32 133722312
> > >
> > > -pgscan_kswapd_dma 145897
> > > -pgscan_kswapd_dma32 266141381
> > > +pgscan_kswapd_dma 287981
> > > +pgscan_kswapd_dma32 186647637
> > >
> > > -pgscan_direct_dma 9666
> > > -pgscan_direct_dma32 1758655
> > > +pgscan_direct_dma 302495
> > > +pgscan_direct_dma32 80947179
> > >
> > > -pageoutrun 1768531
> > > -allocstall 614
> > > +pageoutrun 1927451
> > > +allocstall 8566
> > >
> > > I attached the full vmstat contents below. Also the test program,
> > > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> >
> > Excellent stuff. I didn't look at your vmstat output because it was for
> > an old patch and you have already highlighted the problems related to
> > the workload. Chances are, I'd just reach the same conclusions. What is
> > interesting is your workload.
>
> [...]
>
> > Ok, well there was some significant feedback on why wholesale changing of
> > congestion_wait() reached too far and I've incorporated that feedback. I
> > also integrated your workload into my testsuite (btw, because there is no
> > license the script has to download it from a google archive. I might get
> > back to you on licensing this so it can be made a permanent part of the suite).
>
> Oh, certainly, feel free to add the following file header:
>
> /*
> * Copyright (c) 2010 Johannes Weiner
> * Code released under the GNU GPLv2.
> */
>

Super, thanks. Downloading off a mail archive is a bit contorted :)

> > These are the results just for your workload on the only machine I had
> > available with a lot of disk. There are a bunch of kernels because I'm testing
> > a superset of different series posted recently. The nocongest column is an
> > unreleased patch that has congestion_wait() and wait_iff_congested() that
> > only goes to sleep if there is real congestion or a lot of writeback going
> > on. Rather than worrying about the patch contents for now, lets consider
> > the results for just your workload.
> >
> > The report is in 4 parts. The first is the vmstat counter differences as
> > a result of running your test. The exact interpretation of good and bad
> > here is open to interpretation. The second part is based on the vmscan
> > tracepoints. The third part is based on the congestion tracepoints and
> > the final part reports CPU usage and elapsed time.
> >
> > MICRO
> > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> > pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
> > pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
> > pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
> > pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
> > pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
> > pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
> > pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
> > pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
> > pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
> > allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)
> >
> >
> > So, the allocstall counts go up of course because it is incremented
> > every time direct reclaim is entered and nocongest is only going to
> > sleep when there is congestion or significant writeback. I don't see
> > this as being nceessarily bad.
>
> Agreed. Also the dma zone is less allocated from, which I suppose is
> only the second zone in the zonelist, after dma32. So allocations
> succeed more often from the first-choice zone with your patches.
>

Which is good sortof.

> > Direct scanning rates go up a bit as you'd expect - again because we
> > are sleeping less. It's interesting that the pages reclaimed is
> > reduced implying that despite higher scanning rates, there is less
> > reclaim activity.
> >
> > It's debatable if this is good or not because higher scanning rates in
> > themselves are not bad but fewer pages reclaimed seems positive so lets
> > see what the rest of the reports look like.
> >
> > FTrace Reclaim Statistics: vmscan
> > micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
> > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> > Direct reclaims 2722 70156 67554 87691
> > Direct reclaim pages scanned 23955333 40161426 40034132 45080524
> > Direct reclaim write file async I/O 0 0 0 0
> > Direct reclaim write anon async I/O 0 0 0 0
> > Direct reclaim write file sync I/O 0 0 0 0
> > Direct reclaim write anon sync I/O 0 0 0 0
> > Wake kswapd requests 2718040 17801688 17622777 18379572
> > Kswapd wakeups 24 1 1 1
> > Kswapd pages scanned 177738381 161597313 161729224 156704078
> > Kswapd reclaim write file async I/O 0 0 0 0
> > Kswapd reclaim write anon async I/O 0 0 0 0
> > Kswapd reclaim write file sync I/O 0 0 0 0
> > Kswapd reclaim write anon sync I/O 0 0 0 0
> > Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
> > Time kswapd awake (seconds) 489.17 400.20 403.19 390.08
> >
> > Total pages scanned 201693714 201758739 201763356 201784602
> > %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> > %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> > Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
> > Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%
> >
> > Interesting, kswapd is now staying awake (woke up only once) even though
> > the total time awake was reduced and it looks like because it was requested
> > to wake up a lot more that was keeping it awake. Despite the higher scan
> > rates from direct reclaim, the time actually spent direct reclaiming is
> > significantly reduced.
> >
> > Scanning rates and times we direct reclaim go up but as we finish work a
> > lot faster, it would seem that we are doing less work overall.
>
> I do not reach the same conclusion here. More pages are scanned
> overall on the same workload, so we _are_ doing more work.
>

We did more scanning but we finished sooner so the processor would sleep
sooner, consume less power etc. How good or bad this is depends on what
we mean by work but I am of the opinion that finishing sooner is better
overall. That said...

> The result for this single-threaded workload improves because CPU-time
> is not the issue when the only runnable process needs memory.
>
> But we are in fact becoming less efficient at reclaim, so it would
> make sense to also test how this interacts with other processes that
> do need the CPU concurrently.
>

This is indeed a problem. Measuring multiple workloads and drawing
sensible conclusions is going to be a real pain. I'd at least try
running multiple instances of your workload first and seeing how it
interacts.

> > FTrace Reclaim Statistics: congestion_wait
> > Direct number congest waited 3664 0 0 0
> > Direct time congest waited 247636ms 0ms 0ms 0ms
> > Direct full congest waited 3081 0 0 0
> > Direct number conditional waited 0 47587 45659 58779
> > Direct time conditional waited 0ms 0ms 0ms 0ms
> > Direct full conditional waited 3081 0 0 0
> > KSwapd number congest waited 1448 949 909 981
> > KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
> > KSwapd full congest waited 1056 90 115 147
> > KSwapd number conditional waited 0 0 0 0
> > KSwapd time conditional waited 0ms 0ms 0ms 0ms
> > KSwapd full conditional waited 1056 90 115 147
> >
> > congest waited is congestion_wait() and conditional waited is
> > wait_iff_congested(). Look at what happens to the congest waited times
> > for direct reclaim - it disappeared and despite the number of times
> > wait_iff_congested() was called, it never actually decided it needed to
> > sleep. kswapd is still congestion waiting but the time it spent is
> > reduced.
> >
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
> > Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53
> >
> > This is plain old time. The same test completes 91 seconds faster.
> > Ordinarily at this point I would be preparing to do a full series report
> > including the other benchmarks but I'm interested in seeing if there is
> > a significantly different reading of the above report as to whether it
> > is a "good" or "bad" result?
>
> I think one interesting piece that is missing is whether the
> scanned/reclaimed ratio went up. Do you have the kswapd_steal counter
> value still available to calculate that ratio?
>

Knowing the scanning/reclaimed ration would be nice. The tracepoints as-is
are orientiated around scanning/write ratio and assumed discarding clean
pages was not that big a deal. I did capture vmstat before and after the
test which is not as fine-grained as the tracepoints but it'll do

Vanilla scan: 203050870
Vanilla steal: 102099263
Nocongest scan: 203117496
Nocongest stal: 102114301

Vanilla ratio: 0.5028
Nocongest ratio: 0.5027

So here is an interesting point. While direct scanning rates went up,
overall scanning rates were approximately the same. All that changed
really was who was doing the scanning which is a timing issue. The
reclaim to scanning ratio were *very* close together for the vanilla and
nocongest kernels but nocongest completed far faster.

> A "good" result would be, IMO, if that ratio did not get worse, while
> at the same time having reclaim perform better due to reduced sleeps.
>

So, the ratio did not get worse. The scanning rates were the same although
who was scanning changed and I'm undecided as to whether that is significant
or not. Overall, the test completed faster which is good.

> Another aspect to look out for is increased overreclaim: the total
> number of allocations went up (I suppose the sum of reclaimed pages as
> well), which means reclaim became more eager and created more
> throwout-refault churn. Those were refaults from a sparse-file, but a
> slow backing dev will have more impact on wall clock time.
>

I can capture the reclaim rates handy enough - just needs another trace
point to keep everything in step. Refaults would be trickier so I would
have a tendency to rely on wall time to tell me if the wrong reclaim
decisions were being made.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/