Re: [PATCH -mm 07/25] second chance replacement for anonymous pages

From: Andrew Morton
Date: Fri Jun 06 2008 - 21:06:39 EST


On Fri, 06 Jun 2008 16:28:45 -0400
Rik van Riel <riel@xxxxxxxxxx> wrote:

> From: Rik van Riel <riel@xxxxxxxxxx>
>
> We avoid evicting and scanning anonymous pages for the most part, but
> under some workloads we can end up with most of memory filled with
> anonymous pages. At that point, we suddenly need to clear the referenced
> bits on all of memory, which can take ages on very large memory systems.
>
> We can reduce the maximum number of pages that need to be scanned by
> not taking the referenced state into account when deactivating an
> anonymous page. After all, every anonymous page starts out referenced,
> so why check?
>
> If an anonymous page gets referenced again before it reaches the end
> of the inactive list, we move it back to the active list.
>
> To keep the maximum amount of necessary work reasonable, we scale the
> active to inactive ratio with the size of memory, using the formula
> active:inactive ratio = sqrt(memory in GB * 10).

Should be scaled by PAGE_SIZE?

> Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
> instead of by the amount of memory present in the system.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
>
> ---
> include/linux/mm_inline.h | 12 ++++++++++++
> include/linux/mmzone.h | 5 +++++
> mm/page_alloc.c | 40 ++++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 38 +++++++++++++++++++++++++++++++-------
> mm/vmstat.c | 6 ++++--
> 5 files changed, 92 insertions(+), 9 deletions(-)
>
> Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h 2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h 2008-05-28 12:09:06.000000000 -0400
> @@ -97,4 +97,16 @@ del_page_from_lru(struct zone *zone, str
> __dec_zone_state(zone, NR_INACTIVE_ANON + l);
> }
>
> +static inline int inactive_anon_low(struct zone *zone)
> +{
> + unsigned long active, inactive;
> +
> + active = zone_page_state(zone, NR_ACTIVE_ANON);
> + inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> +
> + if (inactive * zone->inactive_ratio < active)
> + return 1;
> +
> + return 0;
> +}

inactive_anon_low: "number of inactive anonymous pages which are in lowmem"?

Nope.

Needs a comment. And maybe a better name, like inactive_anon_is_low.
Although making the return type a bool kind-of does that.

> #endif
> Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h 2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h 2008-05-28 12:09:06.000000000 -0400
> @@ -311,6 +311,11 @@ struct zone {
> */
> int prev_priority;
>
> + /*
> + * The ratio of active to inactive pages.
> + */
> + unsigned int inactive_ratio;

That comment needs a lot of help please. For a start, it's plain wrong
- inactive_ratio would need to be a float to be able to record that ratio.

The comment should describe the units too.

Now poor-old-reviewer has to go off and work out what this thing is.

>
> ZONE_PADDING(_pad2_)
> /* Rarely used or read-mostly fields */
> Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c 2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/page_alloc.c 2008-05-28 12:09:06.000000000 -0400
> @@ -4269,6 +4269,45 @@ void setup_per_zone_pages_min(void)
> calculate_totalreserve_pages();
> }
>
> +/**
> + * setup_per_zone_inactive_ratio - called when min_free_kbytes changes.
> + *
> + * The inactive anon list should be small enough that the VM never has to
> + * do too much work, but large enough that each inactive page has a chance
> + * to be referenced again before it is swapped out.
> + *
> + * The inactive_anon ratio is the ratio of active to inactive anonymous

target ratio? Desired ratio?

> + * pages. Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
> + * on the inactive list.
> + *
> + * total return max
> + * memory value inactive anon

This function doesn't "return" a "value".

> + * -------------------------------------
> + * 10MB 1 5MB
> + * 100MB 1 50MB
> + * 1GB 3 250MB
> + * 10GB 10 0.9GB
> + * 100GB 31 3GB
> + * 1TB 101 10GB
> + * 10TB 320 32GB
> + */
> +void setup_per_zone_inactive_ratio(void)
> +{
> + struct zone *zone;
> +
> + for_each_zone(zone) {
> + unsigned int gb, ratio;
> +
> + /* Zone size in gigabytes */
> + gb = zone->present_pages >> (30 - PAGE_SHIFT);
> + ratio = int_sqrt(10 * gb);
> + if (!ratio)
> + ratio = 1;
> +
> + zone->inactive_ratio = ratio;
> + }
> +}

OK, so inactive_ratio is an integer 1 .. N which determines our target
number of inactive pages according to the formula

nr_inactive = nr_active / inactive_ratio

yes?

Can nr_inactive get larger than this? I assume so. I guess that
doesn't matter much. Except the problems which you're trying to sovle
here can reoccur. What would I need to do to trigger that?

> /*
> * Initialise min_free_kbytes.
> *
> @@ -4306,6 +4345,7 @@ static int __init init_per_zone_pages_mi
> min_free_kbytes = 65536;
> setup_per_zone_pages_min();
> setup_per_zone_lowmem_reserve();
> + setup_per_zone_inactive_ratio();
> return 0;
> }
> module_init(init_per_zone_pages_min)
> Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c 2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/vmscan.c 2008-05-28 12:11:38.000000000 -0400
> @@ -114,7 +114,7 @@ struct scan_control {
> /*
> * From 0 .. 100. Higher means more swappy.
> */
> -int vm_swappiness = 60;
> +int vm_swappiness = 20;

<goes back to check the changelog>

Whoa. Where'd this come from?

> long vm_total_pages; /* The total number of pages which the VM controls */
>
> static LIST_HEAD(shrinker_list);
> @@ -1008,7 +1008,7 @@ static inline int zone_is_near_oom(struc
> static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> struct scan_control *sc, int priority, int file)
> {
> - unsigned long pgmoved;
> + unsigned long pgmoved = 0;
> int pgdeactivate = 0;
> unsigned long pgscanned;
> LIST_HEAD(l_hold); /* The pages which were snipped off */
> @@ -1036,17 +1036,32 @@ static void shrink_active_list(unsigned
> __mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
> spin_unlock_irq(&zone->lru_lock);
>
> + pgmoved = 0;

didn't we just do that?

> while (!list_empty(&l_hold)) {
> cond_resched();
> page = lru_to_page(&l_hold);
> list_del(&page->lru);
> - if (page_referenced(page, 0, sc->mem_cgroup))
> - list_add(&page->lru, &l_active);
> - else
> + if (page_referenced(page, 0, sc->mem_cgroup)) {
> + if (file) {
> + /* Referenced file pages stay active. */
> + list_add(&page->lru, &l_active);
> + } else {
> + /* Anonymous pages always get deactivated. */

hm. That's going to make the machine swap like hell. I guess I don't
understand all this yet.

> + list_add(&page->lru, &l_inactive);
> + pgmoved++;
> + }
> + } else
> list_add(&page->lru, &l_inactive);
> }
>
> /*
> + * Count the referenced anon pages as rotated, to balance pageout
> + * scan pressure between file and anonymous pages in get_sacn_ratio.

tpyo


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/