[PATCH 3/3] vmscan: decrease pages_scanned on unevictable page

From: Minchan Kim
Date: Tue May 24 2011 - 18:09:17 EST


If there are many unevictable pages on evictable LRU list(ex, big ramfs),
shrink_page_list will move it into unevictable and can't reclaim pages.
But we already increased zone->pages_scanned.
If the situation is repeated, the number of evictable lru pages is decreased
while zone->pages_scanned is increased without reclaim any pages.
It could turn on zone->all_unreclaimable but it's totally false alram.

Signed-off-by: Minchan Kim <minchan.kim@xxxxxxxxx>
---
mm/vmscan.c | 22 +++++++++++++++++++---
1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 08d3077..a7df813 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -700,7 +700,8 @@ static noinline_for_stack void
free_page_list(struct list_head *free_pages)
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
- unsigned long *dirty_pages)
+ unsigned long *dirty_pages,
+ unsigned long *unevictable_pages)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -708,6 +709,7 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
+ unsigned long nr_unevictable = 0;

cond_resched();

@@ -908,6 +910,7 @@ cull_mlocked:
try_to_free_swap(page);
unlock_page(page);
putback_lru_page(page);
+ nr_unevictable++;
continue;

activate_locked:
@@ -936,6 +939,7 @@ keep_lumpy:
zone_set_flag(zone, ZONE_CONGESTED);

*dirty_pages = nr_dirty;
+ *unevictable_pages = nr_unevictable;
free_page_list(&free_pages);

list_splice(&ret_pages, page_list);
@@ -1372,6 +1376,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_dirty;
+ unsigned long nr_unevictable;
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
@@ -1425,7 +1430,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
spin_unlock_irq(&zone->lru_lock);

reclaim_mode = sc->reclaim_mode;
- nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty,
&nr_unevictable);

/* Check if we should syncronously wait for writeback */
if ((nr_dirty && !(reclaim_mode & RECLAIM_MODE_SINGLE) &&
@@ -1434,7 +1439,8 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
unsigned long nr_active = clear_active_flags(&page_list, NULL);
count_vm_events(PGDEACTIVATE, nr_active);
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc, &nr_dirty);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+ &nr_dirty, &nr_unevictable);
}

local_irq_disable();
@@ -1442,6 +1448,16 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);

+ /*
+ * Too many unevictalbe pages on evictable LRU list(ex, big ramfs)
+ * can make high zone->pages_scanned and reduce the number of lru page
+ * on evictable lru as reclaim is going on.
+ * It could turn on all_unreclaimable which is false alarm.
+ */
+ spin_lock(&zone->lru_lock);
+ if (zone->pages_scanned >= nr_unevictable)
+ zone->pages_scanned -= nr_unevictable;
+ else
+ zone->pages_scanned = 0;
+ spin_unlock(&zone->lru_lock);
+
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);

trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
--
1.7.1

===

Then, what I doubt secondly is zone_set_flag(zone, ZONE_CONGESTED).
He used swap as crypted device mapper.
Device mapper could make IO slow for his work. It means we are likely
to meet ZONE_CONGESTED higher than normal swap.

Let's think about it.
Swap device is very congested so shrink_page_list would set the zone
as CONGESTED.
Who is clear ZONE_CONGESTED? There are two place in kswapd.
One work in only order > 0. So maybe, it's no-op in Andy's
workload.(ie, it's mostly order-0 allocation)
One remained is below.

* If a zone reaches its high watermark,
* consider it to be no longer congested. It's
* possible there are dirty pages backed by
* congested BDIs but as pressure is relieved,
* spectulatively avoid congestion waits
*/
zone_clear_flag(zone, ZONE_CONGESTED);
if (i <= *classzone_idx)
balanced += zone->present_pages;

It works only if the zone meets high watermark. If allocation is
faster than reclaim(ie, it's true for slow swap device), the zone
would remain congested.
It means swapout would block.
As we see the OOM log, we can know that DMA32 zone can't meet high watermark.

Does my guessing make sense?


--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/