Re: What size anonymous folios should we allocate?

From: Kent Overstreet
Date: Tue Feb 21 2023 - 22:09:16 EST


On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
> On Tue, Feb 21, 2023 at 1:49 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > In a sense this question is premature, because we don't have any code
> > in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> > but let's pretend that code already exists and is just waiting for us
> > to answer this policy question.
> >
> > I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> > option and 3. a sysfs tunable. It is foolish to expect the distro
> > packager or the sysadmin to be able to make such a decision. The
> > correct decision will depend upon the instantaneous workload of the
> > entire machine and we'll want different answers for different VMAs.
>
> Yeah, I agree those 3 options should be avoided. For some
> architectures, there are a or multiple sweet size(s) benefiting from
> hardware. For example, ARM64 contiguous PTE supports up to 16
> consecutive 4K pages to form a 64K entry in TLB instead of 16 4K
> entries. Some implementations may support intermediate sizes (for
> example, 8K, 16K and 32K, but this may make the hardware design
> harder), but some may not. AMD's coalesce PTE supports a different
> size (128K if I remember correctly). So the multiple of the size
> supported by hardware (64K or 128K) seems like the common ground from
> maximizing hardware benefit point of view. Of course, nothing prevents
> the kernel from allocating other orders.
>
> ARM even supports contiguous PMD, but that would be too big to
> allocate by buddy allocator.

Every time this discussion comes up it seems like MM people have a major
blind spot, where they're only thinking about PTE looking and TLB
overhead and forgetting every other codepath in the kernel that deals
with cached data - historically one physical page at a time.

By framing the discussion in terms of what's best for the hardware,
you're screwing over all the pure software codepaths. This stupity has
gone on for long enough with the ridicurous normalpage/hugepage split,
let's not continue it.

Talk to any filesystem person, you don't want to fragment data
unnecessarily. That's effectively what you're advocating for, by
continuing to talk about hardware page sizes.

You need to get away from designing things around hardware limitations
and think in more general terms. The correct answer is "anonymous pages
should be any power of two size".