Re: Silent hang up caused by pages being not scanned?

From: Michal Hocko
Date: Tue Oct 13 2015 - 09:33:15 EST


On Tue 13-10-15 00:25:53, Tetsuo Handa wrote:
[...]
> What is strange, the values printed by this debug printk() patch did not
> change as time went by. Thus, I think that this is not a problem of lack of
> CPU time for scanning pages. I suspect that there is a bug that nobody is
> scanning pages.
>
> ----------
> [ 66.821450] zone_reclaimable returned 1 at line 2646
> [ 66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [ 66.824935] shrink_zones returned 1 at line 2706
> [ 66.826392] zones_reclaimable=1 at line 2765
> [ 66.827865] do_try_to_free_pages returned 1 at line 2938
> [ 67.102322] __perform_reclaim returned 1 at line 2854
> [ 67.103968] did_some_progress=1 at line 3301
> (...snipped...)
> [ 281.439977] zone_reclaimable returned 1 at line 2646
> [ 281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [ 281.439978] shrink_zones returned 1 at line 2706
> [ 281.439978] zones_reclaimable=1 at line 2765
> [ 281.439979] do_try_to_free_pages returned 1 at line 2938
> [ 281.439979] __perform_reclaim returned 1 at line 2854
> [ 281.439980] did_some_progress=1 at line 3301

This is really interesting because even with reclaimable LRUs this low
we should eventually scan them enough times to convince zone_reclaimable
to fail. PAGES_SCANNED in your logs seems to be constant, though, which
suggests somebody manages to free a page every time before we get down
to priority 0 and manage to scan something finally. This is pretty much
pathological behavior and I have hard time to imagine how would that be
possible but it clearly shows that zone_reclaimable heuristic is not
working properly.

I can see two options here. Either we teach zone_reclaimable to be less
fragile or remove zone_reclaimable from shrink_zones altogether. Both of
them are risky because we have a long history of changes in this areas
which made other subtle behavior changes but I guess that the first
option should be less fragile. What about the following patch? I am not
happy about it because the condition is rather rough and a deeper
inspection is really needed to check all the call sites but it should be
good for testing.
---