Re: Detecting page cache trashing state

From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
Date: Fri Oct 27 2017 - 16:30:39 EST


On 10/26/2017 06:53 AM, vinayak menon wrote:
On Thu, Sep 28, 2017 at 9:19 PM, Ruslan Ruslichenko -X (rruslich -
GLOBALLOGIC INC at Cisco) <rruslich@xxxxxxxxx> wrote:
Hi Johannes,

Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
version by us right now)
and test a bit.
The overall idea definitely looks promising, although I have one question on
usage.
Will it be able to account the time which processes spend on handling major
page faults
(including fs and iowait time) of refaulting page?

As we have one big application which code space occupies big amount of place
in page cache,
when the system under heavy memory usage will reclaim some of it, the
application will
start constantly thrashing. Since it code is placed on squashfs it spends
whole CPU time
decompressing the pages and seem memdelay counters are not detecting this
situation.
Here are some counters to indicate this:

19:02:44 CPU %user %nice %system %iowait %steal %idle
19:02:45 all 0.00 0.00 100.00 0.00 0.00 0.00

19:02:44 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s
pgscand/s pgsteal/s %vmeff
19:02:45 15284.00 0.00 428.00 352.00 19990.00 0.00 0.00
15802.00 0.00

And as nobody actively allocating memory anymore looks like memdelay
counters are not
actively incremented:

[:~]$ cat /proc/memdelay
268035776
6.13 5.43 3.58
1.90 1.89 1.26

Just in case, I have attached the v4.9.26 rebased patched.

Looks like this 4.9 version does not contain the accounting in lock_page.

In v4.9 there is no wait_on_page_bit_common(), thus accounting moved to
wait_on_page_bit(_killable|_killable_timeout).
Related functionality around lock_page_or_retry() seem to be mostly the same in v4.9.