Re: [PATCH v6 11/12] mm/vmalloc: Hugepage vmalloc mappings

From: Nicholas Piggin
Date: Fri Aug 21 2020 - 12:06:26 EST


Excerpts from Eric Dumazet's message of August 22, 2020 1:38 am:
>
> On 8/21/20 8:12 AM, Nicholas Piggin wrote:
>> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
>> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
>> supports PMD sized vmap mappings.
>>
>> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
>> larger, and fall back to small pages if that was unsuccessful.
>>
>> Allocations that do not use PAGE_KERNEL prot are not permitted to use huge
>> pages, because not all callers expect this (e.g., module allocations vs
>> strict module rwx).
>>
>> This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
>> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
>>
>> This can result in more internal fragmentation and memory overhead for a
>> given allocation, an option nohugevmalloc is added to disable at boot.
>>
>>
>
> Thanks for working on this stuff, I tried something similar in the past,
> but could not really do more than a hack.
> ( https://lkml.org/lkml/2016/12/21/285 )

Oh nice. It might be possible to do some ideas from your patch
still. Higher order pages smaller than PMD size, or the memory
policy stuff, perhaps.

> Note that __init alloc_large_system_hash() is used at boot time,
> when NUMA policy is spreading allocations over all NUMA nodes.
>
> This means that on a dual node system, a hash table should be 50/50 spread.
>
> With your patch, if a hashtable is exactly the size of one huge page,
> the location of this hashtable will be not balanced, this might have some
> unwanted impact.

In that case it shouldn't because it divides by the number of nodes,
but it will in general have a bit larger granularity in balancing than
smaller pages of course.

There's probably a better way to size these important hashes on NUMA. I
suspect most of the time you have a NUMA machine you actually would
prefer to use large pages now, even if it means taking up to 2MB more
memory per node per hash. It's not a great amount and the allocation
size is rather arbitrary anyway.

Thanks,
Nick