Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

From: Johannes Weiner
Date: Fri May 29 2020 - 11:13:02 EST


On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> 2020ë 5ì 29ì (ê) ìì 2:02, Johannes Weiner <hannes@xxxxxxxxxxx>ëì ìì:
> > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > 2020ë 5ì 27ì (ì) ìí 10:43, Johannes Weiner <hannes@xxxxxxxxxxx>ëì ìì:
> > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > *It would require another page flag to tell whether a refaulting cache
> > page has challenged the anon set once (transitioning) or repeatedly
> > (thrashing), as we currently use the active state for that. If we
> > would repurpose PG_workingset to tell the first from the second
> > refault, we'd need a new flag to mark a page for memstall accounting.
>
> I don't understand why a new flag is needed. Whenever we found that
> challenge is needed (dist < active + anon), we need to add up IO cost.

It sounds like this was cleared up later on in the email.

> > > It could cause thrashing for your patch.
> > > Without the patch, current logic try to
> > > find most hottest file pages that are fit into the current file list
> > > size and protect them
> > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> > > which is less than whole file list size but larger than inactive list
> > > size. Without
> > > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > > pages will be
> > > protected from the 100MB low access frequency pages. 100 MB low access
> > > frequency pages will be refaulted repeatedely but it's correct behaviour.
> >
> > Hm, something doesn't quite add up. Why is the 50M hot set evicted
> > with my patch?
>
> Thanks for kind explanation. I read all and I found that I was confused before.
> Please let me correct the example.
>
> Environment:
> anon: 500 MB (hot) / 500 MB (hot)
> file: 50 MB (so hot) / 50 MB (dummy)
>
> I will call 50 MB file hot pages as F(H).
> Let's assume that periodical access to other file (500 MB) is started. That
> file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5.
>
> Problem will occur on following access pattern.
>
> F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...
>
> With this access pattern, access distance of F(H) and Pn is:
>
> F(H) = 150 MB
> Pn = 750 MB
>
> Without your patch, F(H) is kept on the memory since deactivation would not
> happen. However, with your patch, deactivation happens since Pn's refault
> distance is less than 'active file + anon'. In the end, F(H) would be finally
> evicted.

Okay, this example makes sense to me.

I do think P needs to challenge the workingset - at first. P could
easily fit into memory by itself if anon and active_file were cold, so
we need to challenge them to find out that they're hot. As you can
see, if anon and F(H) were completely unused, the current behavior
would be incorrect.

The current behavior would do the same in a cache-only example:

anon = 1G (unreclaimable)
file = 500M (hot) / 300M (dummy)

P = 400M

F(H) -> P1 -> F(H) -> P2 ...

If F(H) is already active, the first P refaults would have a distance
of 100M, thereby challenging F(H). As F(H) reactivates, its size will
be reflected in the refault distances, pushing them beyond the size of
memory that is available to the cache: 600M refault distance > 500M
active cache, or 900M access distance > 800M cache space.

However, I agree with your observation about the anon age below. When
we start aging the anon set, we have to reflect that in the refault
distances. Once we know that the 1G anon pages are indeed hotter than
the pages in P, there is no reason to keep churning the workingset.

> > The only way they could get reclaimed is if their access distance ends
> > up bigger than the file cache. But if that's the case, then the
> > workingset is overcommitted, and none of the pages qualify for reclaim
> > protection. Picking a subset to protect against the rest is arbitrary.
>
> In the fixed example, although other file (500 MB) is repeatedly accessed,
> it's not workingset. If we have unified list (file + anon), access distance of
> Pn will be larger than whole memory size. Therefore, it's not overcommitted
> workingset and this patch wrongly try to activate it. As I said before,
> without considering inactive_age for anon list, this calculation can not be
> correct.

You're right. If we don't take anon age into account, the activations
could be over-eager; however, so would counting IO cost and exerting
pressure on anon be, which means my previous patch to split these two
wouldn't fix fundamental the problem you're pointing out. We simply
have to take anon age into account for the refaults to be comparable.

Once we do that, in your example, we would see activations in the
beginning in order to challenge the combined workingset (active_file +
anon) - which is legitimate as long as we don't know it's hot. And as
the anon pages are scanned and rotated (and the challenged F(h)
reactivated), the refault distances increase accordingly to reflect
the size of the hot pages sampled, which will correctly put P's
distances beyond the size of available memory.

However, your example cannot have a completely silent stable state. As
we stop workingset aging, the refault distances will slowly increase
again. We will always have a bit of churn, and rightfully so, because
the workingset *could* go stale.

That's the same situation in my cache-only example above. Anytime you
have a subset of pages that by itself could fit into memory, but can't
because of an established workingset, ongoing sampling is necessary.

But the rate definitely needs to reduce as we detect that in-memory
pages are indeed hot. Otherwise we cause more churn than is required
for an appropriate rate of workingset sampling.

How about the patch below? It looks correct, but I will have to re-run
my tests to make sure I / we are not missing anything.

---