Re: [PATCH v10 2/5] mseal: add mseal syscall

From: Liam R. Howlett
Date: Tue Apr 16 2024 - 11:01:51 EST


* jeffxu@xxxxxxxxxxxx <jeffxu@xxxxxxxxxxxx> [240415 12:35]:
> From: Jeff Xu <jeffxu@xxxxxxxxxxxx>
>
> The new mseal() is an syscall on 64 bit CPU, and with
> following signature:
>
> int mseal(void addr, size_t len, unsigned long flags)
> addr/len: memory range.
> flags: reserved.
>
> mseal() blocks following operations for the given memory range.
>
> 1> Unmapping, moving to another location, and shrinking the size,
> via munmap() and mremap(), can leave an empty space, therefore can
> be replaced with a VMA with a new set of attributes.
>
> 2> Moving or expanding a different VMA into the current location,
> via mremap().
>
> 3> Modifying a VMA via mmap(MAP_FIXED).
>
> 4> Size expansion, via mremap(), does not appear to pose any specific
> risks to sealed VMAs. It is included anyway because the use case is
> unclear. In any case, users can rely on merging to expand a sealed VMA.
>
> 5> mprotect() and pkey_mprotect().
>
> 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
> memory, when users don't have write permission to the memory. Those
> behaviors can alter region contents by discarding pages, effectively a
> memset(0) for anonymous memory.
>
> Following input during RFC are incooperated into this patch:
>
> Jann Horn: raising awareness and providing valuable insights on the
> destructive madvise operations.
> Linus Torvalds: assisting in defining system call signature and scope.
> Liam R. Howlett: perf optimization.
> Theo de Raadt: sharing the experiences and insight gained from
> implementing mimmutable() in OpenBSD.
>
> Finally, the idea that inspired this patch comes from Stephen Röttger’s
> work in Chrome V8 CFI.

No per-vma change is checked prior to entering a per-vma modification
loop today. This means that mseal() differs in behaviour in "up-front
failure" vs "partial change failure" that exists in every other
function.

I'm not saying it's wrong or that it's right - I'm just wondering what
the direction is here. Either we should do as much up-front as
possible or keep with tradition and have (partial) success where
possible.

If you look at do_mprotect_pkey(), you can even see
map_deny_write_exec() being checked in a loop during modifications.

I think we can all agree that having some up-front and some later
without any reason will lead to a higher probability of things getting
missed.

Thanks,
Liam