Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

From: Andrea Arcangeli
Date: Tue May 30 2017 - 11:43:38 EST


On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> I sysctl for the mapcount can be increased, right? I also assume that
> those vmas will get merged after the post copy is done.

Assuming you enlarge the sysctl to the worst possible case, with 64bit
address space you can have billions of VMAs if you're migrating 4T of
RAM and you're unlucky and the address space gets fragmented. The
unswappable kernel memory overhead would be relatively large
(i.e. dozen gigabytes of RAM in vm_area_struct slab), and each
find_vma operation would need to walk ~40 steps across that large vma
rbtree. There's a reason the sysctl exist. Not to tell all those
unnecessary vma mangling operations would be protected by the mmap_sem
for writing.

Not creating a ton of vmas and enabling vma-less pte mangling with a
single large vma and only using mmap_sem for reading during all the
pte mangling, is one of the primary design motivations for
userfaultfd.

> I understand that part but it sounds awfully one purpose thing to me.
> Are we going to add other MADVISE_RESET_$FOO to clear other flags just
> because we can race in this specific use case?

Those already exists, see for example MADV_NORMAL, clearing
~VM_RAND_READ & ~VM_SEQ_READ after calling MADV_SEQUENTIAL or
MADV_RANDOM.

Or MADV_DOFORK after MADV_DONTFORK. MADV_DONTDUMP after MADV_DODUMP. Etc..

> But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
> enable/disable thp. Doesn't that sound little bit too much for a single
> feature to you?

MADV_NOHUGEPAGE doesn't mean clearing the flag set with
MADV_HUGEPAGE. MADV_NOHUGEPAGE disables THP on the region if the
global sysfs "enabled" tune is set to "always". MADV_HUGEPAGE enables
THP if the global "enabled" sysfs tune is set to "madvise". The two
MADV_NOHUGEPAGE and MADV_HUGEPAGE are needed to leverage the three-way
setting of "never" "madvise" "always" of the global tune.

The "madvise" global tune exists if you want to save RAM and you don't
care much about performance but still allowing apps like QEMU where no
memory is lost by enabling THP, to use THP.

There's no way to clear either of those two flags and bring back the
default behavior of the global sysfs tune, so it's not redundant at
the very least.