Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

From: David Hildenbrand
Date: Tue Jan 26 2021 - 06:59:56 EST


On 26.01.21 12:46, Michal Hocko wrote:
On Thu 21-01-21 14:27:19, Mike Rapoport wrote:
From: Mike Rapoport <rppt@xxxxxxxxxxxxx>

Removing a PAGE_SIZE page from the direct map every time such page is
allocated for a secret memory mapping will cause severe fragmentation of
the direct map. This fragmentation can be reduced by using PMD-size pages
as a pool for small pages for secret memory mappings.

Add a gen_pool per secretmem inode and lazily populate this pool with
PMD-size pages.

As pages allocated by secretmem become unmovable, use CMA to back large
page caches so that page allocator won't be surprised by failing attempt to
migrate these pages.

The CMA area used by secretmem is controlled by the "secretmem=" kernel
parameter. This allows explicit control over the memory available for
secretmem and provides upper hard limit for secretmem consumption.

OK, so I have finally had a look at this closer and this is really not
acceptable. I have already mentioned that in a response to other patch
but any task is able to deprive access to secret memory to other tasks
and cause OOM killer which wouldn't really recover ever and potentially
panic the system. Now you could be less drastic and only make SIGBUS on
fault but that would be still quite terrible. There is a very good
reason why hugetlb implements is non-trivial reservation system to avoid
exactly these problems.

So unless I am really misreading the code
Nacked-by: Michal Hocko <mhocko@xxxxxxxx>

That doesn't mean I reject the whole idea. There are some details to
sort out as mentioned elsewhere but you cannot really depend on
pre-allocated pool which can fail at a fault time like that.

So, to do it similar to hugetlbfs (e.g., with CMA), there would have to be a mechanism to actually try pre-reserving (e.g., from the CMA area), at which point in time the pages would get moved to the secretmem pool, and a mechanism for mmap() etc. to "reserve" from these secretmem pool, such that there are guarantees at fault time?

What we have right now feels like some kind of overcommit (reading, as overcommiting huge pages, so we might get SIGBUS at fault time).

TBH, the SIGBUS thingy doesn't sound terrible to me - if this behavior is to be expected right now by applications using it and they can handle it - no guarantees. I fully agree that some kind of reservation/guarantee mechanism would be preferable.

--
Thanks,

David / dhildenb