Re: [PATCH v3 17/17] mm: add knob to tune lazyfreeing

From: Minchan Kim
Date: Fri Nov 13 2015 - 01:19:44 EST


On Thu, Nov 12, 2015 at 11:44:53AM -0800, Shaohua Li wrote:
> On Thu, Nov 12, 2015 at 01:33:13PM +0900, Minchan Kim wrote:
> > MADV_FREEed page's hotness is very arguble.
> > Someone think it's hot while others are it's cold.
> >
> > Quote from Shaohua
> > "
> > My main concern is the policy how we should treat the FREE pages. Moving it to
> > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > to make sure FREE pages can be freed before system wide page reclaim. As you
> > said, this is arguable, but I hope we can discuss about this issue more.
> > "
> >
> > Quote from me
> > "
> > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > but it's not true because the page would be dirty state when VM want to reclaim.
> >
> > I'm also against with your's suggestion which let's discard FREEed page before
> > system wide page reclaim because system would have lots of clean cold page
> > caches or anonymous pages. In such case, reclaiming of them would be better.
> > Yeb, it's really workload-dependent so we might need some heuristic which is
> > normally what we want to avoid.
> >
> > Having said that, I agree with you we could do better than the deactivation
> > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > it could promote it to acive-anon-LRU which would be very natural aging
> > concept because it mean someone touches the page recenlty.
> > With that, I don't want to bias one side and don't want to add some knob for
> > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > "
> >
> > Quote from Johannes
> > "
> > thread 1:
> > Even if we're wrong about the aging of those MADV_FREE pages, their
> > contents are invalidated; they can be discarded freely, and restoring
> > them is a mere GFP_ZERO allocation. All other anonymous pages have to
> > be written to disk, and potentially be read back.
> >
> > [ Arguably, MADV_FREE pages should even be reclaimed before inactive
> > page cache. It's the same cost to discard both types of pages, but
> > restoring page cache involves IO. ]
> >
> > It probably makes sense to stop thinking about them as anonymous pages
> > entirely at this point when it comes to aging. They're really not. The
> > LRU lists are split to differentiate access patterns and cost of page
> > stealing (and restoring). From that angle, MADV_FREE pages really have
> > nothing in common with in-use anonymous pages, and so they shouldn't
> > be on the same LRU list.
> >
> > thread:2
> > What about them is hot? They contain garbage, you have to write to
> > them before you can use them. Granted, you might have to refetch
> > cachelines if you don't do cacheline-aligned populating writes, but
> > you can do a lot of them before it's more expensive than doing IO.
> >
> > "
> >
> > Quote from Daniel
> > "
> > thread:1
> > Keep in mind that this is memory the kernel wouldn't be getting back at
> > all if the allocator wasn't going out of the way to purge it, and they
> > aren't going to go out of their way to purge it if it means the kernel
> > is going to steal the pages when there isn't actually memory pressure.
> >
> > An allocator would be using MADV_DONTNEED if it didn't expect that the
> > pages were going to be used against shortly. MADV_FREE indicates that it
> > has time to inform the kernel that they're unused but they could still
> > be very hot.
> >
> > thread:2
> > It's hot because applications churn through memory via the allocator.
> >
> > Drop the pages and the application is now churning through page faults
> > and zeroing rather than simply reusing memory. It's not something that
> > may happen, it *will* happen. A page in the page cache *may* be reused,
> > but often won't be, especially when the I/O patterns don't line up well
> > with the way it works.
> >
> > The whole point of the feature is not requiring the allocator to have
> > elaborate mechanisms for aging pages and throttling purging. That ends
> > up resulting in lots of memory held by userspace where the kernel can't
> > reclaim it under memory pressure. If it's dropped before page cache, it
> > isn't going to be able to replace any of that logic in allocators.
> >
> > The page cache is speculative. Page caching by allocators is not really
> > speculative. Using MADV_FREE on the pages at all is speculative. The
> > memory is probably going to be reused fairly soon (unless the process
> > exits, and then it doesn't matter), but purging will end up reducing
> > memory usage for the portions that aren't.
> >
> > It would be a different story for a full unpinning/pinning feature since
> > that would have other use cases (speculative caches), but this is really
> > only useful in allocators.
> > "
> > You could read all thread from https://lkml.org/lkml/2015/11/4/51
> >
> > Yeah, with arguble issue and there is no one decision, I think it
> > means we should provide the knob "lazyfreeness"(I hope someone
> > give better naming).
> >
> > It's similar to swapppiness so higher values will discard MADV_FREE
> > pages agreessively. If memory pressure happens and system works with
> > DEF_PRIOIRTY(ex, clean cold caches), VM doesn't discard any hinted
> > pages until the scanning priority is increased.
> >
> > If memory pressure is higher(ie, the priority is not DEF_PRIORITY),
> > it scans
> >
> > nr_to_reclaim * priority * lazyfreensss(def: 20) / 50
> >
> > If system has low free memory and file cache, it start to discard
> > MADV_FREEed pages unconditionally even though user set lazyfreeness to 0.
> >
> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > ---
> > Documentation/sysctl/vm.txt | 13 +++++++++
> > drivers/base/node.c | 4 +--
> > fs/proc/meminfo.c | 4 +--
> > include/linux/memcontrol.h | 1 +
> > include/linux/mmzone.h | 9 +++---
> > include/linux/swap.h | 15 ++++++++++
> > kernel/sysctl.c | 9 ++++++
> > mm/memcontrol.c | 32 +++++++++++++++++++++-
> > mm/vmscan.c | 67 ++++++++++++++++++++++++++++-----------------
> > mm/vmstat.c | 2 +-
> > 10 files changed, 121 insertions(+), 35 deletions(-)
> >
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index a4482fceacec..c1dc63381f2c 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -56,6 +56,7 @@ files can be found in mm/swap.c.
> > - percpu_pagelist_fraction
> > - stat_interval
> > - swappiness
> > +- lazyfreeness
> > - user_reserve_kbytes
> > - vfs_cache_pressure
> > - zone_reclaim_mode
> > @@ -737,6 +738,18 @@ The default value is 60.
> >
> > ==============================================================
> >
> > +lazyfreeness
> > +
> > +This control is used to define how aggressive the kernel will discard
> > +MADV_FREE hinted pages. Higher values will increase agressiveness,
> > +lower values decrease the amount of discarding. A value of 0 instructs
> > +the kernel not to initiate discarding until the amount of free and
> > +file-backed pages is less than the high water mark in a zone.
> > +
> > +The default value is 20.
> > +
> > +==============================================================
> > +
> > - user_reserve_kbytes
> >
> > When overcommit_memory is set to 2, "never overcommit" mode, reserve
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index f7a1f2107b43..3b0bf1b78b2e 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -69,8 +69,8 @@ static ssize_t node_read_meminfo(struct device *dev,
> > "Node %d Inactive(anon): %8lu kB\n"
> > "Node %d Active(file): %8lu kB\n"
> > "Node %d Inactive(file): %8lu kB\n"
> > - "Node %d Unevictable: %8lu kB\n"
> > "Node %d LazyFree: %8lu kB\n"
> > + "Node %d Unevictable: %8lu kB\n"
> > "Node %d Mlocked: %8lu kB\n",
> > nid, K(i.totalram),
> > nid, K(i.freeram),
> > @@ -83,8 +83,8 @@ static ssize_t node_read_meminfo(struct device *dev,
> > nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
> > nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
> > nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
> > - nid, K(node_page_state(nid, NR_UNEVICTABLE)),
> > nid, K(node_page_state(nid, NR_LZFREE)),
> > + nid, K(node_page_state(nid, NR_UNEVICTABLE)),
> > nid, K(node_page_state(nid, NR_MLOCK)));
> >
> > #ifdef CONFIG_HIGHMEM
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index 3444f7c4e0b6..f47e6a5aa2e5 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -101,8 +101,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > "Inactive(anon): %8lu kB\n"
> > "Active(file): %8lu kB\n"
> > "Inactive(file): %8lu kB\n"
> > - "Unevictable: %8lu kB\n"
> > "LazyFree: %8lu kB\n"
> > + "Unevictable: %8lu kB\n"
> > "Mlocked: %8lu kB\n"
> > #ifdef CONFIG_HIGHMEM
> > "HighTotal: %8lu kB\n"
> > @@ -159,8 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > K(pages[LRU_INACTIVE_ANON]),
> > K(pages[LRU_ACTIVE_FILE]),
> > K(pages[LRU_INACTIVE_FILE]),
> > - K(pages[LRU_UNEVICTABLE]),
> > K(pages[LRU_LZFREE]),
> > + K(pages[LRU_UNEVICTABLE]),
> > K(global_page_state(NR_MLOCK)),
> > #ifdef CONFIG_HIGHMEM
> > K(i.totalhigh),
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 3e3318ddfc0e..5522ff733506 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -210,6 +210,7 @@ struct mem_cgroup {
> > int under_oom;
> >
> > int swappiness;
> > + int lzfreeness;
> > /* OOM-Killer disable */
> > int oom_kill_disable;
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 1aaa436da0d5..cca514a9701d 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -120,8 +120,8 @@ enum zone_stat_item {
> > NR_ACTIVE_ANON, /* " " " " " */
> > NR_INACTIVE_FILE, /* " " " " " */
> > NR_ACTIVE_FILE, /* " " " " " */
> > - NR_UNEVICTABLE, /* " " " " " */
> > NR_LZFREE, /* " " " " " */
> > + NR_UNEVICTABLE, /* " " " " " */
> > NR_MLOCK, /* mlock()ed pages found and moved off LRU */
> > NR_ANON_PAGES, /* Mapped anonymous pages */
> > NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> > @@ -179,14 +179,15 @@ enum lru_list {
> > LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
> > LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> > LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> > - LRU_UNEVICTABLE,
> > LRU_LZFREE,
> > + LRU_UNEVICTABLE,
> > NR_LRU_LISTS
> > };
> >
> > #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> > -
> > -#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
> > +#define for_each_anon_file_lru(lru) \
> > + for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
> > +#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_LZFREE; lru++)
> >
> > static inline int is_file_lru(enum lru_list lru)
> > {
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index f0310eeab3ec..73bcdc9d0e88 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -330,6 +330,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > unsigned long *nr_scanned);
> > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> > extern int vm_swappiness;
> > +extern int vm_lazyfreeness;
> > extern int remove_mapping(struct address_space *mapping, struct page *page);
> > extern unsigned long vm_total_pages;
> >
> > @@ -361,11 +362,25 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> > return memcg->swappiness;
> > }
> >
> > +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *memcg)
> > +{
> > + /* root ? */
> > + if (mem_cgroup_disabled() || !memcg->css.parent)
> > + return vm_lazyfreeness;
> > +
> > + return memcg->lzfreeness;
> > +}
> > +
> > #else
> > static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
> > {
> > return vm_swappiness;
> > }
> > +
> > +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *mem)
> > +{
> > + return vm_lazyfreeness;
> > +}
> > #endif
> > #ifdef CONFIG_MEMCG_SWAP
> > extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index e69201d8094e..2496b10c08e9 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1268,6 +1268,15 @@ static struct ctl_table vm_table[] = {
> > .extra1 = &zero,
> > .extra2 = &one_hundred,
> > },
> > + {
> > + .procname = "lazyfreeness",
> > + .data = &vm_lazyfreeness,
> > + .maxlen = sizeof(vm_lazyfreeness),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = &zero,
> > + .extra2 = &one_hundred,
> > + },
> > #ifdef CONFIG_HUGETLB_PAGE
> > {
> > .procname = "nr_hugepages",
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1dc599ce1bcb..5bdbe2a20dc0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -108,8 +108,8 @@ static const char * const mem_cgroup_lru_names[] = {
> > "active_anon",
> > "inactive_file",
> > "active_file",
> > - "unevictable",
> > "lazyfree",
> > + "unevictable",
> > };
> >
> > #define THRESHOLDS_EVENTS_TARGET 128
> > @@ -3288,6 +3288,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> > return 0;
> > }
> >
> > +static u64 mem_cgroup_lzfreeness_read(struct cgroup_subsys_state *css,
> > + struct cftype *cft)
> > +{
> > + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +
> > + return mem_cgroup_lzfreeness(memcg);
> > +}
> > +
> > +static int mem_cgroup_lzfreeness_write(struct cgroup_subsys_state *css,
> > + struct cftype *cft, u64 val)
> > +{
> > + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +
> > + if (val > 100)
> > + return -EINVAL;
> > +
> > + if (css->parent)
> > + memcg->lzfreeness = val;
> > + else
> > + vm_lazyfreeness = val;
> > +
> > + return 0;
> > +}
> > +
> > static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> > {
> > struct mem_cgroup_threshold_ary *t;
> > @@ -4085,6 +4109,11 @@ static struct cftype mem_cgroup_legacy_files[] = {
> > .write_u64 = mem_cgroup_swappiness_write,
> > },
> > {
> > + .name = "lazyfreeness",
> > + .read_u64 = mem_cgroup_lzfreeness_read,
> > + .write_u64 = mem_cgroup_lzfreeness_write,
> > + },
> > + {
> > .name = "move_charge_at_immigrate",
> > .read_u64 = mem_cgroup_move_charge_read,
> > .write_u64 = mem_cgroup_move_charge_write,
> > @@ -4305,6 +4334,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> > memcg->use_hierarchy = parent->use_hierarchy;
> > memcg->oom_kill_disable = parent->oom_kill_disable;
> > memcg->swappiness = mem_cgroup_swappiness(parent);
> > + memcg->lzfreeness = mem_cgroup_lzfreeness(parent);
> >
> > if (parent->use_hierarchy) {
> > page_counter_init(&memcg->memory, &parent->memory);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cd65db9d3004..f1abc8a6ca31 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -141,6 +141,10 @@ struct scan_control {
> > */
> > int vm_swappiness = 60;
> > /*
> > + * From 0 .. 100. Higher means more lazy freeing.
> > + */
> > +int vm_lazyfreeness = 20;
> > +/*
> > * The total number of pages which are beyond the high watermark within all
> > * zones.
> > */
> > @@ -2012,10 +2016,11 @@ enum scan_balance {
> > *
> > * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
> > * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
> > + * nr[4] = lazy free pages to scan;
> > */
> > static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > - struct scan_control *sc, unsigned long *nr,
> > - unsigned long *lru_pages)
> > + int lzfreeness, struct scan_control *sc,
> > + unsigned long *nr, unsigned long *lru_pages)
> > {
> > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > u64 fraction[2];
> > @@ -2023,12 +2028,13 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > struct zone *zone = lruvec_zone(lruvec);
> > unsigned long anon_prio, file_prio;
> > enum scan_balance scan_balance;
> > - unsigned long anon, file;
> > + unsigned long anon, file, lzfree;
> > bool force_scan = false;
> > unsigned long ap, fp;
> > enum lru_list lru;
> > bool some_scanned;
> > int pass;
> > + unsigned long scan_lzfree = 0;
> >
> > /*
> > * If the zone or memcg is small, nr[l] can be 0. This
> > @@ -2166,7 +2172,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > /* Only use force_scan on second pass. */
> > for (pass = 0; !some_scanned && pass < 2; pass++) {
> > *lru_pages = 0;
> > - for_each_evictable_lru(lru) {
> > + for_each_anon_file_lru(lru) {
> > int file = is_file_lru(lru);
> > unsigned long size;
> > unsigned long scan;
> > @@ -2212,6 +2218,28 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > some_scanned |= !!scan;
> > }
> > }
> > +
> > + lzfree = get_lru_size(lruvec, LRU_LZFREE);
> > + if (lzfree) {
> > + scan_lzfree = sc->nr_to_reclaim *
> > + (DEF_PRIORITY - sc->priority);
>
> scan_lzfree == 0 if sc->priority == DEF_PRIORITY, is this intended?
> > + scan_lzfree = div64_u64(scan_lzfree *
> > + lzfreeness, 50);
> > + if (!scan_lzfree) {
> > + unsigned long zonefile, zonefree;
> > +
> > + zonefree = zone_page_state(zone, NR_FREE_PAGES);
> > + zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_FILE);
> > + if (unlikely(zonefile + zonefree <=
> > + high_wmark_pages(zone))) {
> > + scan_lzfree = get_lru_size(lruvec,
> > + LRU_LZFREE) >> sc->priority;
> > + }
> > + }
> > + }
> > +
> > + nr[LRU_LZFREE] = min(scan_lzfree, lzfree);
> > }
>
> Looks there is no setting to only reclaim lazyfree pages. Could we have an
> option for this? It's legit we don't want to trash page cache because of
> lazyfree memory.

Once we introduc the knob, it could be doable.
I will do it in next spin.

Thanks for the review!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/