Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure

From: Johannes Weiner
Date: Sat Oct 11 2014 - 19:28:41 EST


On Wed, Oct 08, 2014 at 01:47:25PM -0400, Johannes Weiner wrote:
> On Wed, Oct 08, 2014 at 05:33:29PM +0200, Michal Hocko wrote:
> > On Tue 07-10-14 21:11:06, Johannes Weiner wrote:
> > > On Tue, Oct 07, 2014 at 03:59:50PM +0200, Michal Hocko wrote:
> > > > Another part that matters is the size. Memcgs might be really small and
> > > > that changes the math. Large reclaim target will get to low prio reclaim
> > > > and thus the excessive reclaim.
> > >
> > > I already addressed page size vs. memcg size before.
> > >
> > > However, low priority reclaim does not result in excessive reclaim.
> > > The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
> > > pages, and it exits if the goal has been met. See shrink_lruvec(),
> > > shrink_zone() etc.
> >
> > Now I am confused. shrink_zone will bail out but shrink_lruvec will loop
> > over nr[...] until they are empty and only updates the numbers to be
> > roughly proportional once:
> >
> > if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> > continue;
> >
> > /*
> > * For kswapd and memcg, reclaim at least the number of pages
> > * requested. Ensure that the anon and file LRUs are scanned
> > * proportionally what was requested by get_scan_count(). We
> > * stop reclaiming one LRU and reduce the amount scanning
> > * proportional to the original scan target.
> > */
> > [...]
> > scan_adjusted = true;
> >
> > Or do you rely on
> > /*
> > * It's just vindictive to attack the larger once the smaller
> > * has gone to zero. And given the way we stop scanning the
> > * smaller below, this makes sure that we only make one nudge
> > * towards proportionality once we've got nr_to_reclaim.
> > */
> > if (!nr_file || !nr_anon)
> > break;
> >
> > and SCAN_FILE because !inactive_file_is_low?
>
> That function is indeed quite confusing.
>
> Once nr_to_reclaim has been met, it looks at both LRUs and decides
> which one has the smaller scan target left, sets it to 0, and then
> scales the bigger target in proportion - that means if it scanned 10%
> of nr[file], it sets nr[anon] to 10% of its original size, minus the
> anon pages it already scanned. Based on that alone, overscanning is
> limited to twice the requested size, i.e. 4MB for a 2MB THP page,
> regardless of how low the priority drops.

Sorry, this conclusion is incorrect. The proportionality code can
indeed lead to more overreclaim than that, although I think this is
actually not intended: the comment says "this makes sure we only make
one nudge towards proportionality once we've got nr_to_reclaim," but
once scan_adjusted we never actually check anymore. We we can end up
making a lot more nudges toward proportionality.

However, the following still applies, so it shouldn't matter:

> In practice, the VM is heavily biased to avoid swapping. The forceful
> SCAN_FILE you point out is one condition that avoids the proportional
> scan most of the time. But even the proportional scan is heavily
> biased towards cache - every cache insertion decreases the file
> recent_rotated/recent_scanned ratio, whereas anon faults do not.
>
> On top of that, anon pages start out on the active list, whereas cache
> starts on the inactive, which means that the majority of the anon scan
> target - should there be one - usually translates to deactivation.
>
> So in most cases, I'd expect the scanner to bail after nr_to_reclaim
> cache pages, and in low cache situations it might scan up to 2MB worth
> of anon pages, a small amount of which it might swap.
>
> I don't particularly like the decisions the current code makes, but it
> should work. We have put in a lot of effort to prevent overreclaim in
> the last few years, and a big part of this was decoupling the priority
> level from the actual reclaim results. Nowadays, the priority level
> should merely dictate the scan window and have no impact on the number
> of pages actually reclaimed. I don't expect that it will, but if it
> does, that's a reclaim bug that needs to be addressed. If we ask for
> N pages, it should never reclaim significantly more than that,
> regardless of how aggressively it has to scan to accomplish that.

That being said, I don't get the rationale behind the proportionality
code in shrink_lruvec(). The patch that introduced it - e82e0561dae9
("mm: vmscan: obey proportional scanning requirements for kswapd") -
mentions respecting swappiness, but as per above we ignore swappiness
anyway until we run low on cache and into actual pressure. And under
pressure, once we struggle to reclaim nr_to_reclaim, proportionality
enforces itself when one LRU type target hits zero and we continue to
scan the one for which more pressure was allocated. But as long as we
scan both lists at the same SWAP_CLUSTER_MAX rate and have no problem
getting nr_to_reclaim pages with left-over todo for *both* LRU types,
what's the point of going on?

The justification for enforcing proportionality in direct reclaim is
particularly puzzling:

---

commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813
Author: Mel Gorman <mgorman@xxxxxxx>
Date: Wed Jun 4 16:10:49 2014 -0700

mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY

[...]

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Minor Faults 35154 36784
Major Faults 611 1305
Swap Ins 394 1651
Swap Outs 4394 5891
Allocation stalls 118616 44781
Direct pages scanned 4935171 4602313
Kswapd pages scanned 15921292 16258483
Kswapd pages reclaimed 15913301 16248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency 99% 99%
Kswapd velocity 670088.047 682555.961
Direct efficiency 99% 99%
Direct velocity 207709.217 193212.133
Percentage direct scans 23% 22%
Page writes by reclaim 4858.000 6232.000
Page writes file 464 341
Page writes anon 4394 5891

Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.

---

The timings show nothing useful, but the statistics strongly speak
*against* this patch. Sure, direct reclaim invocations are reduced,
but everything else worsened: minor faults increased, major faults
doubled(!), swapins quadrupled(!!), swapins increased, pages scanned
increased, pages reclaimed increased, reclaim page writes increased.

If direct reclaim is invoked at that rate, kswapd is failing at its
job, and the solution shouldn't be to overscan in direct reclaim. On
the other hand, multi-threaded sparse readers are kind of expected to
overwhelm a single kswapd worker, I'm not sure we should be tuning
allocation latency to such a workload in the first place.

Mel, do you maybe remember details that are not in the changelogs?
Because based on them alone, I think we should look at other ways to
ensure we scan with the right amount of vigor...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/