Re: [RFC PATCH 00/18] Try to free user PTE page table pages

From: Qi Zheng
Date: Wed May 18 2022 - 23:59:16 EST




On 2022/5/18 10:51 PM, David Hildenbrand wrote:
On 17.05.22 10:30, Qi Zheng wrote:


On 2022/4/29 9:35 PM, Qi Zheng wrote:
Hi, >>>
Different from the idea of another patchset of mine before[1], the pte_ref
becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
entryies, and then release the user PTE page table page when checking that
pte_ref is 0. The advantage of this is that there is basically no performance
overhead in percpu mode, but it can also free the empty PTEs. In addition, the
code implementation of this patchset is much simpler and more portable than the
another patchset[1].

Hi David,

I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?

Hi,

yes, I talked about the involved challenges, especially, how malicious
user space can trigger allocation of almost elusively page tables and
essentially consume a lot of unmovable+unswappable memory and even store
secrets in the page table structure.

It is indeed difficult to deal with malicious user space programs,
because as long as there is an entry in PTE page table page that
maps the physical page, the entire PTE page cannot be freed.

So maybe we should first solve the problems encountered in engineering
practice. We encountered the problems I mentioned in the cover letter
several times on our server:

VIRT: 55t
RES: 590g
VmPTE: 110g

They are not malicious programs, they just use jemalloc/tcmalloc
normally (currently jemalloc/tcmalloc often uses mmap+madvise instead
of mmap+munmap to improve performance). And we checked and found taht
most of these VmPTEs are empty.

Of course, normal operations may also lead to the consequences of
similar malicious programs, but we have not found such examples
on our servers.


Empty PTE tables is one such case we care about, but there is more. Even
with your approach, we can still end up with many page tables that are
allocated on higher levels (e.g., PMD tables) or page tables that are

Yes, currently my patch does not consider PMD tables. The reason is that
its maximum memory consumption is only 1G on 64-bits system, the impact
is smaller that 512G of PTE tables.

not empty (especially, filled with the shared zeropage).

This case is indeed a problem, and more difficult. :(


Ideally, we'd have some mechanism that can reclaim also other
reclaimable page tables (e.g., filled with shared zeropage). One idea
was to add reclaimable page tables to the LRU list and to then
scan+reclaim them on demand. There are multiple challenges involved,
obviously. One is how to synchronize against concurrent page table

Agree, the current situation is that holding the read lock of mmap_lock
can ensure that the PTE tables is stable. If the refcount method is not
considered or the logic of the lock that protects the PTE tables is not
changed, then the write lock of mmap_lock should be held to ensure
synchronization (this has a huge impact on performance).

walkers, another one is how to invalidate MMU notifiers from reclaim
context. It would most probably involve storing required information in
the memmap to be able to lock+synchronize.

This may also be a way to explore.


Having that said, adding infrastructure that might not be easy to extend
to the more general case of reclaiming other reclaimable page tables on
multiple levels (esp PMD tables) might not be what we want. OTOH, it
gets the job done for once case we care about.

It's really hard to tell what to do because reclaiming page tables and
eventually handling malicious user space correctly is far from trivial :)

Yeah, agree :(


I'll be on vacation until end of May, I'll come back to this mail once
I'm back.


OK, thanks, and have a nice holiday.

--
Thanks,
Qi