Re: Sudden and massive page cache eviction

From: Peter SchÃller
Date: Wed Nov 24 2010 - 09:02:38 EST


Hello,

first of all, thank you very much for taking the time to analyze the situation!

> Yeah, drop_caches doesn't seem very likely.
>
> Your postgres data looks the cleanest and is probably the easiest to
> analyze. ÂMight as well start there:
>
> Â Â Â Âhttp://files.spotify.com/memcut/postgresql_weekly.png

Since you wanted to look at that primarily I re-visited that
particular case to confirm what was really happening. Unfortunately I
have to retract my claim here, because it turns out that we have
backups running locally on the machine before shipping them away, and
it seems that indeed the cache evictions on that machine are
correlated with the removal of said backup after it was shipped away
(and testing confirms the behavior).

Of course that is entirely expected (that removal of a recently
written file will cause a sudden spike in free memory) so the
PostgreSQL graph is a red herring.

This was unfortunate, and a result of me picking this guy fairly
ad-hoc for the purpose of summarizing the situation in my post to the
list. We have spent considerable time trying to make sure that the
evictions are indeed anomalous and not e.g. due to a file removal in
the case of the actual service where we are negatively affected, but I
was not sufficiently careful before proclaiming that we seem to see a
similar effects on other hosts.

It may still be the case, but I am not finding a case which is
sufficiently clear at this time (being sure requires really looking at
what is going on with each class of machine and eliminating various
forms of backups, log rotation and other behavior exhibited). This
also means that in general it's not certain that we are in fact seeing
this behavior on others at all.

However it does leave all other observations, including the very
direct correlation in time with load spikes and a lack of correlation
with backups jobs, and the fact that the eviction seems to be of
actively used data given the resulting I/O storm. So I feel confident
in saying that we definitely do have an actual issue (although as
previously indicated we have not proven conclusively that there is
absolutely no userspace application allocating and touching lots of
pages suddenly, but it seems very unlikely).

> Just eyeballing it, _most_ of the evictions seem to happen after some
> movement in the active/inactive lists. ÂWe see an "inactive" uptick as
> we start to launder pages, and the page activation doesn't keep up with
> it. ÂThis is a _bit_ weird since we don't see any slab cache or other
> users coming to fill the new space. ÂSomething _wanted_ the memory, so
> why isn't it being used?

In this case we have the writing of a backup file (with corresponding
page touching for reading data). This is followed by a period of
reading the recently written file, followed by it being removed.

> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
> allocations happening in the kernel?

No; we're not using huge pages at all (not consciously). Looking at
/proc/meminfo I can confirm that we just see this:

HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

So hopefully that should not be a factor.

> If you could start recording /proc/{vmstat,buddystat,meminfo,slabinfo},
> it would be immensely useful. ÂThe munin graphs are really great, but
> they don't have the detail which you can get from stuff like vmstat.

Absolutely. I'll get some recording of those going and run for
sufficient duration to correlate with page evictions.

> For a page-cache-heavy workload where you care a lot more about things
> being _in_ cache rather than having good NUMA locality, you probably
> want "zone_reclaim_mode" set to 0:
>
> Â Â Â Âhttp://www.kernel.org/doc/Documentation/sysctl/vm.txt
>
> That'll be a bit more comprehensive than messing with numactl. ÂIt
> really is the best thing if you just don't care about NUMA latencies all
> that much.

Thanks! That looks to be exactly what we would like in this case and,
if Interpret you and the documentation correctly, obviates the need to
ask for interleaved allocation.

> ÂWhat kind of hardware is this, btw?

It varies somewhat in age, but all of them are Intel. The oldest ones
have 16 GB of memory and are of this CPU type:

cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz
stepping : 6
cpu MHz : 2494.168
cache size : 6144 KB

While newer ones have ~36 GB memory and are of this CPU type:

cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
stepping : 5
cpu MHz : 2000.049
cache size : 4096 KB

Some variation beyond that may exist, but that is the span (and all
are Intel, 8 cores or more).

numactl --show on older machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0
nodebind: 0
membind: 0

And on newer machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1

--
/ Peter Schuller aka scode
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/