Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

From: Michal Hocko
Date: Thu Jan 19 2017 - 06:44:01 EST


On Thu 19-01-17 10:07:55, Mel Gorman wrote:
[...]
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
>
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
>
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
>
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
>
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
>
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.

I cannot really say I would like this. It's just much more complex than
necessary. I definitely agree that congestion_wait while waiting for
too_many_isolated is a crude hack. This patch doesn't really resolve
my biggest worry, though, that we go OOM with too many pages isolated
as your patch doesn't alter zone_reclaimable_pages to reflect those
numbers.

Anyway, I think both of us are probably overcomplicating things a bit.
Your waitqueue approach is definitely better semantically than the
congestion_wait because we are waiting for a different event than the
API is intended for. On the other hand a mere
schedule_timeout_interruptible might work equally well in the real life.
On the other side I might really over emphasise the role of NR_ISOLATED*
counts. It might really turn out that we can safely ignore them and it
won't be the end of the world. So what do you think about the following
as a starting point. If we ever see oom reports with high number of
NR_ISOLATED* which are part of the oom report then we know we have to do
something about that. Those changes would at least be driven by a real
usecase rather than theoretical scenarios.

So what do you think about the following? Tetsuo, would you be willing
to run this patch through your torture testing please?
---