Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages

From: Michal Hocko
Date: Mon Feb 27 2017 - 14:54:39 EST


On Mon 27-02-17 12:06:34, Johannes Weiner wrote:
> On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:
> > On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
> > [...]
> > > >From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > Date: Fri, 24 Feb 2017 10:56:32 -0500
> > > Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes
> > >
> > > Jia He reports a problem with kswapd spinning at 100% CPU when
> > > requesting more hugepages than memory available in the system:
> > >
> > > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > >
> > > top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
> > > Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
> > > %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
> > > KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
> > > KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem
> > >
> > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3
> > >
> > > At that time, there are no reclaimable pages left in the node, but as
> > > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > >
> > > Kswapd needs to back away from nodes that fail to balance. Up until
> > > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > > kswapd had such a mechanism. It considered zones whose theoretically
> > > reclaimable pages it had reclaimed six times over as unreclaimable and
> > > backed away from them. This guard was erroneously removed as the patch
> > > changed the definition of a balanced node.
> > >
> > > However, simply restoring this code wouldn't help in the case reported
> > > here: there *are* no reclaimable pages that could be scanned until the
> > > threshold is met. Kswapd would stay awake anyway.
> > >
> > > Introduce a new and much simpler way of backing off. If kswapd runs
> > > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > > page, make it back off from the node. This is the same number of shots
> > > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > > that node until a direct reclaimer manages to reclaim some pages, thus
> > > proving the node reclaimable again.
> >
> > Yes this looks, nice&simple. I would just be worried about [1] a bit.
> > Maybe that is worth a separate patch though.
> >
> > [1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3quxwz@xxxxxxxxxxxxxx
>
> I think I'd prefer the simplicity of keeping this contained inside
> vmscan.c, as an interaction between direct reclaimers and kswapd, as
> well as leaving the wakeup tied to actually seeing reclaimable pages
> rather than merely producing free pages (e.g. should we also add a
> kick to a large munmap() for example?).

OK, that is a good point as well. I was about to argue that a mlock
runaway process killed by the OOM killer should restart kswapd otherwise
the following operation would be quite surprising. But you are right
that there are other sources of large amout of free pages. So you are
right, let's keep it simple for now and do something based on freed
pages.

--
Michal Hocko
SUSE Labs