Re: [GIT PULL] Memory folios for v5.15

From: Johannes Weiner
Date: Tue Aug 24 2021 - 14:31:26 EST


On Mon, Aug 23, 2021 at 11:15:48PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> > However, this far exceeds the goal of a better mm-fs interface. And
> > the value proposition of a full MM-internal conversion, including
> > e.g. the less exposed anon page handling, is much more nebulous. It's
> > been proposed to leave anon pages out, but IMO to keep that direction
> > maintainable, the folio would have to be translated to a page quite
> > early when entering MM code, rather than propagating it inward, in
> > order to avoid huge, massively overlapping page and folio APIs.
>
> I only intend to leave anonymous memory out /for now/. My hope is
> that somebody else decides to work on it (and indeed Google have
> volunteered someone for the task).

Unlike the filesystem side, this seems like a lot of churn for very
little tangible value. And leaves us with an end result that nobody
appears to be terribly excited about.

But the folio abstraction is too low-level to use JUST for file cache
and NOT for anon. It's too close to the page layer itself and would
duplicate too much of it to be maintainable side by side.

That's why I asked why it couldn't be a more abstract memory unit for
managing file cache. With a clearer delineation between that and how
the backing memory is implemented - 1 page, N pages, maybe just a part
of a page later on. And not just be a different name for a head page.

It appears David is asking the same in the parallel subthread.

> > It's also not clear to me that using the same abstraction for compound
> > pages and the file cache object is future proof. It's evident from
> > scalability issues in the allocator, reclaim, compaction, etc. that
> > with current memory sizes and IO devices, we're hitting the limits of
> > efficiently managing memory in 4k base pages per default. It's also
> > clear that we'll continue to have a need for 4k cache granularity for
> > quite a few workloads that work with large numbers of small files. I'm
> > not sure how this could be resolved other than divorcing the idea of a
> > (larger) base page from the idea of cache entries that can correspond,
> > if necessary, to memory chunks smaller than a default page.
>
> That sounds to me exactly like folios, except for the naming.

Then I think you misunderstood me.

> From the MM point of view, it's less churn to do it your way, but
> from the point of view of the rest of the kernel, there's going to
> be unexpected consequences. For example, btrfs didn't support page
> size != block size until just recently (and I'm not sure it's
> entirely fixed yet?)
>
> And there's nobody working on your idea. At least not that have surfaced
> so far. The folio patch is here now.
>
> Folios are also variable sized. For files which are small, we still only
> allocate 4kB to cache them. If the file is accessed entirely randomly,
> we only allocate 4kB chunks at a time. We only allocate larger folios
> when we think there is an advantage to doing so.
>
> This benefit is retained if someone does come along to change PAGE_SIZE
> to 16KiB (or whatever). Folios can still be composed of multiple pages,
> no matter what the PAGE_SIZE is.

The folio doc says "It is at least as large as %PAGE_SIZE";
folio_order() says "A folio is composed of 2^order pages";
page_folio(), folio_pfn(), folio_nr_pages all encode a N:1
relationship. And yes, the name implies it too.

This is in direct conflict with what I'm talking about, where base
page granularity could become coarser than file cache granularity.

Are we going to bump struct page to 2M soon? I don't know. Here is
what I do know about 4k pages, though:

- It's a lot of transactional overhead to manage tens of gigs of
memory in 4k pages. We're reclaiming, paging and swapping more than
ever before in our DCs, because flash provides in abundance the
low-latency IOPS required for that, and parking cold/warm workload
memory on cheap flash saves expensive RAM. But we're continously
scanning thousands of pages per second to do this. There was also
the RWF_UNCACHED thread around reclaim CPU overhead at the higher
end of buffered IO rates. There is the fact that we have a pending
proposal from Google to replace rmap because it's too CPU-intense
when paging into compressed memory pools.

- It's a lot of internal fragmentation. Compaction is becoming the
default method for allocating the majority of memory in our
servers. This is a latency concern during page faults, and a
predictability concern when we defer it to khugepaged collapsing.

- struct page is statically eating gigs of expensive memory on every
single machine, when only some of our workloads would require this
level of granularity for some of their memory. And that's *after*
we're fighting over every bit in that structure.

Base page size becoming bigger than cache entries in the near future
doesn't strike me as an exotic idea. The writing seems to be on the
wall. But the folio appears full of assumptions that conflict with it.

Sure, the patch is here now. But how much time will all the churn buy
us before we may need a do-over? Would clean, incremental changes to
the cache entry abstraction even be possible after we have anon and
all kinds of other compound page internals hanging off of it as well?

Wouldn't it make more sense to decouple filesystems from "paginess",
as David puts it, now instead? Avoid the risk of doing it twice, avoid
the more questionable churn inside mm code, avoid the confusing
proximity to the page and its API in the long-term...