Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap

From: Yu Zhao
Date: Thu Apr 07 2022 - 19:52:09 EST


On Wed, Apr 6, 2022 at 9:46 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Thu, Apr 7, 2022 at 3:04 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 6, 2022 at 8:29 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > > >
> > > > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > > > and clear the accessed bit) can be expensive because pages from
> > > > different VMAs (PA space) are not cache friendly to the rmap (VA
> > > > space). For workloads mostly using mapped pages, the rmap has a high
> > > > CPU cost in the reclaim path.
> > > >
> > > > This patch exploits spatial locality to reduce the trips into the
> > > > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > > > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > > > adjacent PTEs. On finding another young PTE, it clears the accessed
> > > > bit and updates the gen counter of the page mapped by this PTE to
> > > > (max_seq%MAX_NR_GENS)+1.
> > >
> > > Hi Yu,
> > > It seems an interesting feature to save the cost of rmap. but will it lead to
> > > possible judging of cold pages as hot pages?
> > > In case a page is mapped by 20 processes, and it has been accessed
> > > by 5 of them, when we look around one of the 5 processes, the page
> > > will be young and this pte is cleared. but we still have 4 ptes which are not
> > > cleared. then we don't access the page for a long time, but the 4 uncleared
> > > PTEs will still make the page "hot" since they are not cleared, we will find
> > > the page is hot either due to look-arounding the 4 processes or rmapping
> > > the page later?
> >
> > Why are the remaining 4 accessed PTEs skipped? The rmap should check
> > all the 20 PTEs.
>
> for example page A is the neighbour of page B in process 1, when we do rmap
> for B, we look-around and clear A's pte in process 1. but A's ptes are
> still set in
> process 2,3,4,5.

It makes no difference because it's too insignificant. The goal is not
to give several million pages unique timestamps and sort them; it's to
partition pages on the orders one tenth to a few seconds and quickly
find some reasonable candidates. Temporal locality gets weaker
exponentially over time. Even on small systems, the difference is not
measurable if several thousand pages used in the last few seconds are
chosen over another several thousand pages used in the last minute.